Introduction

Attention deficit hyperactivity disorder (ADHD) is one of the most frequently diagnosed neurobehavioral disorders in childhood, with a prevalence rate of approximately 5% [1]. The chronic nature of the disorder and its long-term impact on the social and academic life of affected individuals substantiate the need for early identification and treatment. Behavioral rating scales and interviews with parents and teachers are the most frequently used diagnostic tools in the assessment of ADHD [2,3,4], as the diagnostic criteria are of a behavioral nature [2, 5]. Informant ratings offer an efficient summary of natural observations over extended time spans [6]. A symptom of inattention is met if a person “is often easily distracted” or “often does not seem to listen when spoken to” (DSM-5 [7]). However, terms such as “often” and “easily” are not specifically defined and can therefore be subject to rater bias. Misinterpretation of items, inaccurate recall of events [5], halo effects [8], unknown bases of comparison of the informant [9], factors affecting the informant (e.g., maternal depression [6]), and the socioeconomic status of the target subject [10] may influence the validity of rating scales. The issue of subjectivity is particularly detrimental if ratings are used as an indicator of treatment response in unblinded pre-post intervention designs. Alternative methods for ADHD assessment with greater objectivity are therefore highly desirable. While it is also common to conduct laboratory psychological testing as part of the diagnostic process [2], the consensus view on the validity of such instruments suggests that there is no cognitive litmus test for the diagnosis of ADHD [2, 11, 12]. Pelham et al. [2] emphasized the importance of evaluating observable behavior instead of cognitive test performance for ADHD assessment.

A survey of school psychologists in the U.S. revealed that direct observations are among the most commonly used methodologies in diagnostic processes [13]. Handler and DuPaul [4] consider the use of observations in combination with other assessment methods to be consistent with standards of best practice. For treatment evaluation, blinded observations were claimed to be the gold standard of assessment [14]. The use of qualitative approaches to observation, with anecdotal descriptions of behavior, is widespread among practitioners [15], but these approaches do not allow for psychometric testing [16]. In contrast, systematic direct observation methods are based on standardized scoring procedures that aim to quantify operationally defined, specific behaviors in an objective way, enabling inter-observer agreement to be assessed [17]. These methods employ either a rating scale developed for the purpose of direct observations or a standardized recording strategy. Two typical recording strategies can be distinguished: continuous recording and time sampling of behavior. Continuous recording includes event counting (frequency) and duration recording. In time sampling, a target behavior is coded if it occurs during the whole predefined interval (whole-interval time sampling), at any time within the interval (partial-interval time sampling), or at a fixed moment of time (momentary time sampling) [18]. In ADHD, behavioral categories for observation usually consist of a proxy for attentive and inattentive behavior (i.e., on- and off-task behavior). Additionally, visually detectable aspects of motor activity and indicators of social interactions such as disruptiveness, aggression, and noncompliance constitute common variables in observational approaches.

Systematic observations also represent a common method for diagnostics and treatment evaluation in autism spectrum disorder (ASD) [19]. These observations usually focus on variables describing social play and communication behaviors, challenging behaviors, and stereotypies (e.g., Autism Diagnostic Observation Schedule—ADOS [20]; Early Social Communication Scale—ESCS [21]). Children with conduct disorders (CD) are frequently observed during parent–child interactions (e.g., Parent–Child Interaction Task—PCIT [22]), peer interactions in the laboratory [23] and in the classroom (e.g., Multiple Option Observation System for Experimental Studies—MOOSES [24]). In general, observational methods are more commonly used for externalizing disorders than for internalizing disorders, owing to the overt nature of the behavioral problems. However, behavioral inhibition [25] or avoidance and fear (Anxiety Dimensional Observation Scale [26]) during mother–child interactions can be observational targets for anxiety disorders in children. Some comprehensive observational methods also include scales for internalizing problems (ASEBA-Direct Observation Form (DOF), Test Observation Form (TOF) [27, 28]).

Environments for observation can be roughly divided into naturalistic settings, such as the classroom, and standardized settings, such as the laboratory or clinic [5]. The high ecological validity of naturalistic settings [6] comes at the expense of uncontrollable contextual factors that might affect behavior. In standardized laboratory situations, by contrast, behavior is limited to a distinct given context, whereby the comparability of observed behavior between individuals and between multiple administrations is increased. Laboratory settings also allow for the application of less obtrusive observational methods through one-way mirrors or video cameras. Behavior in the laboratory may, however, be less generalizable due to the artificial nature of the situation.

To date, the psychometric properties and the diagnostic utility of standardized observations in ADHD have only been selectively delineated [2, 29]. A complete overview of their clinical validity in ADHD is lacking. Therefore, the purpose of this article is to comprehensively review the systematic observational instruments that have been used in studies on ADHD, published between 1990 and 2016. The employed tools were evaluated with respect to four clinically relevant issues:

  1. (1)

    Basic reliability measures of the methods are reported, namely inter-rater reliability (IRR) and test–retest reliability (TRR) for samples of ADHD subjects.

  2. (2)

    The predictive validity [2] of observations is discussed, i.e., to what extent such instruments can accurately distinguish between individuals with and without ADHD. The main emphasis is placed on reported classification rates, sensitivity and specificity (sensitivity refers to the ability of a measure to correctly identify cases, whereas specificity refers to the ability to correctly classify individuals without the problem in question).

  3. (3)

    Findings on convergent validity of observational measures are evaluated, i.e., correlations between observational data and other measures of ADHD (mostly parent and teacher ratings).

  4. (4)

    The evidence that behavioral observations detect treatment effects is reviewed.

For ease of reference, the numbering of these clinical issues will be retained and indicated accordingly in the corresponding sections of the review.

Method

With the intention to cover all relevant fields in ADHD research in which observational methods were applied, a search strategy ensuring wide coverage was implemented. Search terms included “ADHD or attention deficit hyperactivity disorder or ADD or attention deficit disorder” for subject field and “direct observ*” or “behavioral observ*” in any field (the asterisks served as wild cards). The initial search was extended by manual analysis of the reference lists of articles and by searches based on the names of observational instruments that were detected by the initial search (see Table 1 for overview of the instrument names). Inclusion criteria were publication in English in a peer-reviewed journal from 1990 to 2016 and the administration of a systematic observational instrument in the study of individuals with a diagnosis of ADHD or symptoms of ADHD. Studies with fewer than ten subjects or with adult subjects were excluded. Objective measures of activity by mechanical or infrared devices (for review see [30, 31]), and aspects of language and private speech (for review see [32]) of individuals with ADHD were not reported. Behavioral measures in choice-impulsivity tasks were not considered as a method of direct systematic observation (for review see [33]).

Table 1 Summary of the reliability and validity information for 29 observational tools in ADHD

Results

The database search generated 685 peer-reviewed articles using PsycINFO, PsycARTICLES, and Medline, finalized on July 6, 2016. Ninety-seven abstracts from the database search fulfilled the inclusion criteria. The comprehensive search including retrievals from reference lists and additional searches based on names of observational instruments yielded 179 studies for review. Studies applying a standardized observation unattached to a specific instrument were excluded from the review (n = 56). This resulted in 123 studies for review. Twelve studies comprised results of more than one observational tool (i.e., from different situations, e.g., classroom and playground). These were specified twice in tables with respect to the specific context. Hence, the tables contain 135 individual entries.

Eighty-two studies reported systematic observations of children with ADHD in naturalistic settings, i.e., low-structured situations with few standardization attempts through the study protocol. These were separated into two major sections: classroom observations (n = 58) and observations of social interactions in natural contexts, e.g., group leisure activities or free play (n = 25).

Situations that were clearly predefined and specified by the study (e.g., room, group size, materials, instruction) were considered as laboratory (even if the observation occurred at home or in a separate room at school). In 52 studies, behavioral observations of children or adolescents with ADHD were conducted in such standardized, non-naturalistic settings. Tables were generated for different observational contexts (e.g., classroom observation, independent play, test session behavior). Within the tables, studies were sorted by study type, i.e., group discrimination, convergent validity, pharmacological and non-pharmacological interventions, and by year of publication. A separate section at the end of the tables displayed the studies with adolescents (n = 13).

The results narrative was structured according to the observational tools and the four research questions to be evaluated for each tool. In some cases, studies from before 1990 or non-ADHD studies were cited if no newer reports or no ADHD-specific studies were available to evaluate the respective issue. These studies were not listed in the tables.

Observation Studies of Children and Adolescents with ADHD in Naturalistic Settings

Systematic Classroom Observations

Fifty-four studies with classroom observations of children and four studies with observations of adolescents with ADHD conducted since 1990 were included in this review (Table 2). In 28 studies, the naturalistic concept was only partially applicable because behavior was investigated in a simulated school situation, i.e., a laboratory school or the classroom of a summer treatment program. In total, 11 different specific classroom observational instruments were applied in the study of ADHD.

Table 2 Classroom observation studies of children and adolescents with ADHD (n = 58)
Classroom Observation Code (COC)

The COC is an early, well-established observational system [91], which was applied in seven of the reviewed studies in Table 2. It assesses 3–12 variables in the classroom (interference, off-task behavior, noncompliance, motor activity, aggression, etc.) and applies a partial-interval time-sampling recording method (15-s intervals). (1) High rates of IRR were documented (phi = .80 − 1, kappa = .77–.94) [36, 48, 67]. In a modified version of the COC, TRR for 32 children with ADHD was highly significant at an interval of 1 day (r = .37–.72), but low at an interval of 2 days (r = .27–.49) [48]. (2) According to the COC categories of off-task behavior, interference, motor activity, and solicitation, 80% of cases of ADHD were correctly classified (false positive error of 9.8%) (in the original study [91]). All COC categories were exhibited at a significantly higher rate by children with ADHD compared to their typically developing classmates in the large sample of the Multimodal Treatment Study (MTA) [36]. (3) Small to moderate significant correlations were reported between observed negativistic behaviors (interference, noncompliance, aggression) and the Inattention and Overactivity with Aggression (IOWA) Conners teacher ratings of aggression (r = .37–.60) [48]. A correlation coefficient of r = .46 was reported between classroom off-task behavior and the IOWA inattention scale [48]. The COC categories correlated modestly (all r < .40) with performance on neuropsychological tasks [49]. (4) Three studies [52, 56, 67] reported significant improvement on the COC with pharmacological treatment.

ADHD School Observation Code (ADHD-SOC)

The ADHD-SOC [92] was developed on the basis of the COC [91]. It was used in one study of Table 2 [57]. The ADHD-SOC assesses interference, motor movement, noncompliance, aggression, and off-task behavior in a 15-s partial-interval time-sampling procedure. (1) IRR was acceptable (kappa = .57–.84) [57]. TRR was not specifically evaluated for the ADHD-SOC. (2) All classroom observational categories of the ADHD-SOC were shown to discriminate children with ADHD and comorbid tic disorder from controls on the group level. A combination of off-task behavior, interference, and noncompliance yielded correct identification of 91% of the subjects, but also misclassification of 20% of peers [57]. (3) There are no reports on the convergent validity of the ADHD-SOC. (4) The ADHD-SOC was sensitive to stimulant drug effects, with observed normalized classroom behavior in approximately 75% of children with ADHD and tic disorder [57].

Behavioral Observation of Students in Schools (BOSS)

The BOSS [93] separates on-task behavior into active (AET) and passive engagement (PET), while off-task behavior is divided into three subcategories: motor, verbal, and passive. Engagement is coded with a momentary time-sampling method (every 15 s), while off-task behavior is coded using a 15-s partial-interval time-sampling procedure. The BOSS was applied with different modifications in nine of the reviewed studies. (1) Adequate IRR was reached, with kappas ranging from .77 to .98 [37, 82]. The TRR was not investigated. Steiner et al. [85] noted a significant improvement in classroom off-task behavior over time in an untreated ADHD control group, which indicates some instability. A dependability study revealed two 30-min observations on two separate days, providing acceptable levels of dependability for progress monitoring purposes [94]. A third of the variance in BOSS on-task behavior within 30 min on 2 days was attributable to individual differences [94]. Single observations, even for a duration of 60 min, could not reach the same dependability as two 30-min observations on 2 days in the same academic subject [94]. (2) The rates of PET and off-task behavior significantly differentiated ADHD children from controls [37, 41, 47]. Based on a regression model, 71% of subjects were correctly classified into the groups of ADHD and peers by the BOSS categories of off-task behavior [41]. (3) BOSS off-task behavior was also a significant predictor of reading achievement in students with ADHD [37]. Inter-correlations between BOSS categories and the teacher AD/HD rating scale-IV were not significant (r = .02–.20) [37]. Hosterman et al. [10] reported moderate significant correlations between some BOSS categories and the teacher AD/HD rating scale-IV (r = .27–.40) and between some BOSS categories and the Conners Teacher Rating Scale (r = .25–.47). (4) Non-pharmacological intervention studies have shown significant behavioral improvement using the BOSS [82, 84, 85].

Direct Observation Form (DOF)

The DOF [27] is composed of a narrative part, a whole-interval sampling recording of on- and off-task behavior (5-s intervals), and an 89-item rating scale of problem behaviors to be completed for observations of 10 min. It assesses five syndrome scales and a DSM-oriented ADHD problem scale with subscales of inattention and hyperactivity-impulsivity. Table 2 includes four studies using the DOF. (1) Correlations between observers ranged from r = .69 to .57 [83] and from r = .97 to 1 [35]. The test–retest coefficients for the DOF scales and its on-task measure ranged between r = .25 and r = .77 (mean for problem scales r = .56) in a sample of 27 clinically referred children, as indicated in the instrument’s manual [27]. According to Volpe et al. [95], five 10-min DOF observations are required to reach acceptable generalizability and dependability for the DOF scales of ADHD problems and hyperactivity-impulsivity, whereas 11–14 observations are necessary for the sluggish cognitive tempo syndrome scale and the attention problem subscale. (2) ADHD subtype differences were demonstrated by the DOF [35]. Discriminant analyses based on DOF classroom observations revealed correct classification rates ranging from 61 to 67% for ADHD combined type versus clinically referred children without ADHD and normal controls, as well as 70% correct classification of ADHD inattentive type versus controls (no significant difference versus the non-ADHD referred clinical sample) [43]. (3) Regarding convergent validity, low to moderate correlations were found between the DOF ADHD scale and the parent AD/HD rating scale-IV (r = .09–.33) and between the DOF ADHD scale and the teacher AD/HD rating scale-IV (r = .21–.36). For two subscales of the DOF (oppositional and intrusive), some incremental validity was demonstrated, as indicated by 2–6% of additional variance accounted for in parent- and teacher-rated ADHD symptoms [51]. (4) As treatment outcome variable, the DOF showed significantly lower levels of externalizing behavior in a treated group of ADHD preschoolers compared to untreated controls [83].

Classroom Observations for Conduct and Attention Deficit Disorder (COCADD)—Children

The COCADD [96] consists of 32 measures in five domains of classroom behavior (position, physical-social orientation, vocal activities, non-vocal activities, play), which are coded using a 2-s whole-interval sampling procedure. Since 1990, modified versions of the COCADD have been applied in six summer treatment program studies with ADHD children. (1) Kappa indices of IRR ranged between .42 and .78 [44] and .69 to .75 [54]. The TRR was not reported. (2) Teacher-identified students with ADHD were predicted in 83% of the cases (with 9% false positives and 24% false negatives) by using three variables of the COCADD (sitting, verbal intrusion, and talking to self) and three measures of desk checks and academic performance in the original study [96]. (3) COCADD overactive behavior correlated significantly with the IOWA Conners teacher rating of inattention-overactivity (r = .23) and COCADD verbal disruptive behavior with teacher-rated inattention-overactivity (r = .21) and aggression (r = .41) in the classroom in a sample of mixed ADHD/disruptive and unselected boys. Otherwise, no significant correlations emerged (e.g., the correlation between COCADD attending and inattention-overactivity was r = .02) [97]. (4) Sensitivity to pharmacological interventions was shown in the analogue classroom of several summer treatment program studies [53, 54, 60, 61] and a laboratory school study [62] for the modified version of the COCADD.

COCADD—Adolescents

Three studies employed the measures of off-task behavior and disruptive behavior of the COCADD in adolescents with ADHD. (1) IRR was low to adequate (phi = .84 [88]; kappa = .39–.83 [89]). No adolescent-specific TRR was reported, but off-task behavior seemed to vary considerably between different school subjects with different teachers (science versus math class; r = .25) [88]. (2) No group comparisons between adolescents with and without ADHD were conducted using the COCADD (no sensitivity/specificity analyses). (3) Student off-task behavior correlated moderately with the teacher AD/HD rating scale-IV (strongest correlation between total ADHD symptoms and on-task behavior r = − .27) [88]. (4) The COCADD variables of off-task and disruptive behavior proved to be sensitive to pharmacological interventions in the analogue junior high school lecture classroom in two summer treatment program studies [89, 90].

Swanson, Kotkin, Agler, M-Flynn, and Pelham Scale (SKAMP)

The 13 items of the SKAMP [98] are highly time- and situation-specific. The SKAMP is used to assess classroom-specific observable symptoms of inattention (e.g., staying seated) and deportment (e.g., getting started) over short time spans of 30–45 min [50]. It has been employed in the laboratory analogue classroom in 20 medication trial studies since 1990. (1) Döpfner et al. [68] found an IRR of r = .61 for the deportment scale and r = .74 for the inattention scale; otherwise, IRR was not reported for the SKAMP. TRR coefficients of the SKAMP were moderate to high (r = .63–.78) [58]. (2) In a large US sample of elementary school students, SKAMP teacher ratings did not predict later diagnosis of ADHD [99]. These SKAMP ratings were, however, based on the teachers’ observations over the previous 4 weeks and thus differ conceptually from direct observational ratings as administered in the laboratory classroom. (3) The SKAMP scales correlated moderately to strongly with the IOWA inattention-overactivity ratings (r = .50–.84), which were rated for the same observation periods by the same observer [58] (i.e., concurrent rather than convergent validity). Swanson et al. [50] reported small to moderate agreements (r = .21–.25) with parent SNAP-IV ratings. (4) Sensitivity to various pharmacological interventions has been repeatedly shown for the SKAMP [58, 59, 63, 66, 68,69,70,71,72,73].

Student Observation System (SOS)

The SOS uses a 30-s momentary time-sampling method to assess adaptive behavior categories (e.g., responding to teacher) and maladaptive behavior categories (e.g., inattention, movement) [34]. It was applied in one study of Table 2 [34]. (1) IRR was acceptable (r = .69–1); TRR was not examined. (2) The category of inappropriate movement and the maladaptive behavior composite differed significantly between ADHD and controls, but discriminant analyses showed that the SOS failed to add information above and beyond that obtained by teacher ratings alone [34]. Convergent validity (3) and treatment sensitivity (4) were not examined for the SOS.

Hillside Behavior Rating Scale (HBRS)

An adapted version of the HBRS was applied in two studies [38, 39] of Table 2. It collected ratings of restlessness, noisiness, interactions, disturbance, frustration, and stimulation search (1) with adequate IRR (r = .70–.98) from videotaped classrooms [38]. TRR was not reported. (2) ADHD children displayed higher rates of behavior on all scales (except interactions) than typically developing peers; no ADHD prediction was calculated [38]. Convergent validity (3) and treatment sensitivity (4) were not examined for the HBRS in the classroom.

Ghent University Classroom Coding Inventory (GUCCI)

The GUCCI is a continuous sampling coding scheme for behaviors of activity, nonsocial vocalization, and social behavior [45] or time on-task [46] (applied in two studies of Table 2). (1) IRR was high (kappa = .74–.99) [45]. TRR was not reported. (2) Significant group differences were found, but no predictive analysis of ADHD was conducted. Convergent validity (3) and sensitivity to change (4) were not evaluated.

Munich Observation of Attention Inventory (MAI)

The MAI measures off- and on-task behavior with the use of a 5-s time-sampling procedure. It was applied in one study [42] (Table 2). (1) IRR and TRR were not assessed. (2) Children with ADHD differed significantly from controls by displaying more off-task behavior, but also initiating more on-task behavior. Passive inattention explained most variance in teacher ratings. Predictive validity was not assessed [42]. (3) Observed off-task behavior was moderately related to teacher DSM-III-R ADHD ratings (r = .41–.50) and inconspicuous on-task behavior (e.g., reading, writing) reached a correlation coefficient of r = − .71 with teacher ADHD ratings [42]. (4) No treatment evaluation study has applied the MAI.

Responses to Interpersonal and Physically Provoking Situations (RIPPS)

The RIPPS is a classroom observation schedule that was applied with ADHD adolescents in one study [87]. It records the student’s emotional responses and triggers. (1) The IRR was high (80%). TRR was not reported. (2) The RIPPS revealed higher rates of off-task and disruptive behavior in adolescents with ADHD than in controls [87]. Predictive validity, convergent validity (3) and treatment response (4) were not evaluated using the RIPPS.

Short Summary

Eleven different systematic tools for classroom observation were used in a total of 58 studies (Table 2: ADHD-SOC [n = 1], BOSS [n = 9], COC [n = 7], COCADD [n = 9], DOF [n = 4], GUCCI [n = 2], HBRS [n = 2], MAI [n = 1], RIPPS [n = 1], SKAMP [n = 21], SOS [n = 1]).

  • IRR: mostly acceptable (r = .61–1, phi = .60–1, kappa = .39–.99). The lowest Pearson r was reported for the SKAMP [68], the lowest phi and kappa coefficients were reported for the COCADD [44, 62, 89]. Not reported in 24 studies.

  • TRR: reported for two instruments (COC, SKAMP), ranging between r = .27 and .78 [48, 58]; between r = .25 and .77 on the DOF for clinically referred children [27].

  • Correct classification: ranged between 61% (DOF [43]) and 86% (ADHD-SOC [57]); analyzed for seven instruments (COC [91], ADHD-SOC [57], BOSS [41], DOF [43], COCADD [96], SKAMP [99], SOS [34]).

  • Convergent validity: reported in nine studies for six different tools (COC [48], BOSS [10, 37], DOF [51], COCADD [88, 97], SKAMP [50, 58], MAI [42]); poor agreements with parent ratings (r = .09–25), moderate to occasionally strong agreements with teacher ratings (r = .21–.93), moderate agreements with neuropsychological tests (r = .26–.40).

  • Treatment outcome: significant pharmacological intervention effects were found using the ADHD-SOC (n = 1), COC (n = 4), COCADD (n = 7), and the SKAMP (n = 20); significant effects of non-pharmacological interventions were found using the BOSS (n = 4) and DOF (n = 1).

Observations in Naturalistic Social Interaction Settings

An overview of social interaction observation studies (n = 25) is given in Table 3. Six different specific observational instruments were employed:

Table 3 Social interaction observation studies of children with ADHD (n = 25)
Summer Research Program Observations

Six studies of Table 3 applied all-day observational schedules during summer research programs involving two (i.e., noncompliance, aggression) to five (i.e., prosocial behavior, social isolation, nonsocial behavior) variables to measure social interactions during different activities by a 5-s whole-interval sampling procedure. (1) IRR did not reach adequate levels in some of the variables (e.g., for prosocial behavior kappa = .31 [100], nonsocial behavior kappa = .40 [108]). Better agreement was reported for noncompliance (kappa = .65) [103] and aggression (kappa = .73) [108]. TRR was not reported. (2) Four summer research program studies reported higher rates of noncompliant and aggressive behavior in ADHD children than in comparison children [100,101,102,103]. Sensitivity and specificity were not evaluated. (3) Correlations between observed aggression and continuous performance task (CPT) scores were moderate but significant (r = .38) [108]. Observed aggression was also significantly correlated with mother-rated externalizing problems on the Child Behavior Checklist (CBCL). Noncompliance and nonsocial behavior revealed no significant associations with parent ratings [108]. (4) Medication had a significant attenuating effect on observed noncompliance and aggression [108].

Early Screening Project (ESP)

The play behavior of young children with ADHD was observed in preschools with the observation component of the ESP [121] in four studies of Table 3. The code uses a partial-interval, a whole-interval, and a momentary-interval time-sampling system with 15-s intervals. (1) The ESP allows different aspects of positive (e.g., positive social engagement) and negative social interactions (e.g., disruptive behavior) to be recorded and has shown adequate IRR (kappas = .81–.93) [106, 107, 109]. (2) During unstructured free play, preschoolers with ADHD displayed significantly more negative social behavior than typically developing children [104]. ADHD subtype and comorbidity did not lead to significant differences on the ESP [106, 107]. Predictive validity was not examined. (3) Teacher ratings on the Social Skills Rating System in children at risk of ADHD correlated weakly (r < .30) with observed ESP solitary play and aggression [109]. (4) The ESP was not used for treatment evaluation.

Coder Observation of Child Adaptation-Revised (COCA-R)

The COCA-R is a preschool observational instrument in a rating scale format. It was applied in one study of Table 3 [120]. (1) Observers achieved adequate IRR (r = .87–.93) on the COCA-R. TRR was not reported. (2) COCA-R scores of an ADHD sample were not compared to healthy controls; no discriminant analyses were conducted. (3) Correlations with the Conners Teacher Rating Scale were moderate and significant (r = .26–.39). (4) Combined parent and child training induced a significant improvement on the COCA-R social contact scale compared to an ADHD waitlist group [120].

Code for Observing Social Activity (COSA) and ADHD-SOC

In the playground and in the lunchroom, observations of children with ADHD were conducted with the COSA in three studies and with its precursor—the ADHD-SOC—in one study of Table 3. Both codes use a 15-s partial-interval time-sampling procedure to record aggression, noncompliance, and appropriate social interactions (30-s intervals in [52]). (1) These observations yielded low to adequate IRR (kappa = .57–.94) [48, 57]. For lunchroom measures, the TRR was almost entirely non-significant at an interval of 1 day but stronger at an interval of 2 days (r = .35–.68). Playground behavior of children with ADHD was highly unstable [48]. (2) Children with ADHD and comorbid tic disorder exhibited higher levels of observed aggressive and noncompliant behavior in the lunchroom and higher levels of physical aggression in the playground than classmates. Specificity and sensitivity analyses of the ADHD-SOC lunchroom and playground behavior were not conducted [57]. (3) Observed aggressive behavior in the lunchroom was significantly correlated with aggression in the IOWA Conners Teacher Rating Scale (r = .41–.66). The IOWA rating of inattention-overactivity was negatively correlated with playground physical aggression (r = − .52) [48]. (4) Significant reductions in aggression in the playground and in the lunchroom with stimulant medication were repeatedly reported [52, 55, 57].

Response-Cost Systems

All-day response-cost systems target directly observable behaviors. These systems produce a frequency count of undesirable behaviors across daily classroom and recreational periods. Ten studies of Table 3 included such point systems, which assessed the frequency of 3–12 variables (e.g., noncompliance, rule following, negative verbalizations). (1) Agreement between raters seemed rather variable (e.g., r = .44–.96 [115]) and TRR is unknown. (2) The predictive validity and (3) the convergent validity were not examined. (4) This instrument was found to be sensitive to various pharmacological [110, 111, 114] and behavioral treatments [118, 119].

Short Summary

Six different systematic tools for naturalistic social interaction observations were used in a total of 25 studies (Table 3: ADHD-SOC [n = 1], response-cost systems [n = 10], COCA-R [n = 1], COSA [n = 3], ESP [n = 4], Summer Research Program Observations [n = 6]).

  • IRR: mostly acceptable (r = .61–.99, ICC = .87–.95, kappa = .30 − 1). The lowest Pearson r was reported for response-cost systems [119], the lowest kappa was reported for Summer Research Program Observations [100]. Not reported in three studies [110, 114, 118].

  • TRR: reported for one instrument (COSA playground and lunchroom observations) [48], ranging between ICC = − .20 and .24.

  • Correct classification: not examined beyond significant differences on the group level (Summer Research Program Observations, ESP, COSA).

  • Convergent validity: reported in four studies for four instruments (COSA [48], COCA-R [120], ESP [109], Summer Research Program Observations [108]); small to moderate agreements with parent ratings (r = .07–.40), teacher ratings (r = .20–.66), and CPT scores (r = .13–.38).

  • Treatment outcome: significant pharmacological intervention effects were found using the ADHD-SOC (n = 1), the COSA (n = 1), and response-cost systems (n = 7); significant effects of non-pharmacological interventions were found using response-cost systems (n = 3) and the COCA-R (n = 1).

Observation Studies of Children and Adolescents with ADHD in Laboratory Settings

Independent Play Observations

Since 1990, nine studies with ADHD children employing a specific observational tool during independent play have been published (Table 4).

Table 4 Independent play observation studies of children with ADHD (n = 9)
Structured Observation of Academic and Play Settings (SOAPS)

Different versions of the instrument SOAPS [131] were applied in six of the reported studies. Its original version consists of a free and a restricted 15-min play session, in which the duration of on-task behavior, fidgeting, out-of-seat, vocalizing, and the number of task shifts and position changes (i.e., floor grid crossings) is recorded [122]. SOAPS behaviors were time-sampled by the use of a 10-s partial-interval method [122]. Continuous sampling of the duration of behavior (i.e., off-task) and the frequency of behaviors (i.e., grid crossings) was conducted from videotapes [123, 124]. (1) Acceptable IRR was reached (kappas = .73–.99 [124]; 85–99% agreement [127]). Significant long-term stability was reported among a sample of clinic-referred boys for the playroom measures of position changes, on-task behavior, out-of-seat, and vocalizations over 2 years (r = .40–.52 [132]). TRR was otherwise not assessed. (2) Roberts [122] reported 64 and 58% correct classifications of hyperactive, aggressive, and hyperactive and aggressive boys in free play and restricted play, respectively. The original SOAPS was later modified for use with preschoolers. The addition of “forbidden” toys to the playroom differentiated preschoolers with and without ADHD quite strongly [124], but this effect could not be replicated [126]. Seventy percent of mentally retarded children with ADHD were classified correctly by the SOAPS as cases [125]. Children of the ADHD inattentive type could not be discriminated from controls based on their playroom behavior [123]. (3) Convergent validity was not assessed. (4) Significant effects of MPH were reported for the SOAPS free play behavior in ADHD children with mental retardation (e.g., fewer vocalizations, less movement) [127].

Index of Attention/Engagement

Three intervention studies [128,129,130] applied the observational measure of the index of attention/engagement while children played with a standardized toy. To calculate the index, the observed time on-task was divided by the number of attention switches. The higher the index, the more attention and the less switching were displayed. (1) Acceptable IRR (r = .76–91) [128, 129] and a high TRR coefficient (r = .81) were reported [128]. Another—much lower—TRR score of .54 was reported in a waitlist ADHD group of 19 subjects [129]. (2) Preschoolers with ADHD had a significantly lower index of attention/engagement than preschoolers without ADHD [128]. Specificity and sensitivity of the index were not evaluated. (3) Convergent validity was not assessed. (4) One study [128] revealed a significantly less pronounced decrease on the attention/engagement index in the treatment group than in the control group. Otherwise, no treatment-related changes were reported [129, 130].

Short Summary

Two different systematic tools for independent play observations were used in a total of nine studies (Table 4: Index of attention/engagement [n = 3], SOAPS [n = 6]).

  • IRR: good (all coefficients > .70). Not reported in two studies [123, 130].

  • TRR: reported for one instrument (index of attention/engagement) [128, 129], ranging between r = .49 and .81.

  • Correct classification: ranged between 58 and 70% on the SOAPS [122, 125].

  • Convergent validity: not examined.

  • Treatment outcome: significant pharmacological intervention effects were found using the SOAPS [127]; no significant effects of non-pharmacological interventions were detected using the index of attention/engagement [128,129,130].

Test Session Behavioral Observations

Table 5 displays 27 studies in which children’s or adolescents’ test or task behavior was assessed using an observational instrument.

Table 5 Test session observation studies of children and adolescents with ADHD (n = 27)
Restricted Academic Situation (RAS)—Children

The RAS was implemented in 23 of the reviewed studies from Table 5. Originally, the RAS was an extension of the free and restricted play observations SOAPS [131, 153]. Individuals perform written academic math problems in playroom surroundings for 15 min as a laboratory analogue to classroom seatwork. A time-sampling strategy is applied to record the occurrence of usually five behavioral categories within 30-s intervals: off-task behavior, out-of-seat, fidgeting, vocalizing behavior, and object play (hereafter referred to as RAS measures). These same variables and methodology have also been applied to observe behavior during CPTs [133, 137]. (1) Acceptable IRR was reached for the RAS (e.g., ICC = .97–.99 [147], kappa = .86–1 [57]). Significant TRR in school-aged children with ADHD was reported by Karama et al. [147] (factor task disengagement r = .67; factor motor activity r = .61). An earlier study reported a TRR coefficient of r = .86 for RAS total ADHD behavior [153]. (2) The proportion of time on-task of the RAS most effectively separated hyperactive from aggressive children (86%) [122]. A correct classification rate of 64% was reported for children with mental retardation and ADHD by the RAS [125]. However, consistent evidence of discriminatory power for this paradigm is missing, as it was not possible to significantly distinguish between girls with and without ADHD [154], and another study failed to find significant between-group differences in off-task behavior between ADHD children and healthy controls during academic seatwork [133]. Findings are inconsistent regarding subtype differences [123, 137]. (3) Correlations between the RAS behavioral codes and CPT omission and commission errors were low to moderate (r = .26–.34) [6]. Pliszka [134] reported a significant correlation of r = .39 between CPT commission errors and RAS total score. Total ADHD behavior correlated significantly with parent hyperactivity ratings on the CBCL (r = .28), while observed off-task behavior correlated significantly with inattention on the Child Attention Problems Inattention rating scale for teachers (r = .28) and on the Conners Teacher Rating Scale (r = .26) in the same sample. (4) Significant positive treatment outcome was repeatedly shown in the RAS measures [127, 140,141,142,143,144,145,146,147,148,149], also when the same observational categories were applied in the regular classroom [155, 156].

RAS—adolescents

The RAS was adapted for adolescents by adding distracting music to the playroom. Four studies with adolescent participants are shown in Table 5. (1) Adequate IRR was reached [150, 152]. Adolescent-specific TRR was not evaluated. (2) Adolescents with ADHD were successfully discriminated from healthy controls by all RAS measures [150], although not consistently [151]. An age-related decline was found in most observational variables [150]. Compared to a clinical control group without ADHD, adolescents with ADHD were not found to display higher scores on the RAS [152]. (3) In the same study, no significant correlations between the RAS measures and other diagnostic instruments were found [152]. However, Barkley [6] reported low to moderate correlations (r = .26–.36) between the impulsive-hyperactive factor of the Conners parent rating scale and RAS measures in a mixed sample of adolescents with and without ADHD. (4) Medication functioned as a significant covariate in between-group comparisons, which suggests some sensitivity to pharmacological treatment for the adolescent RAS [152].

Guide to Assessment of Test Session Behavior (GATSB)

The GATSB [157] is a normed 29-item rating scale that is completed by examiners after the administration of intelligence tests. It yields scores on the subjects’ avoidance, inattentiveness, and uncooperative mood during testing and was applied in one study of Table 5 [157]. (1) Reliability was not evaluated. (2) Classification analysis based on the GATSB revealed a hit rate of 81%, sensitivity of 88%, and specificity of 76% for differentiating children with ADHD hyperactive-impulsive type from non-ADHD controls [135]. (3) Convergent validity and (4) treatment sensitivity were not assessed.

Test Observation Form (TOF)

The TOF is a comprehensive direct behavioral rating scale [28], which consists of 125 items describing the child’s behavior, affect, and test-taking style. It was employed in two studies of Table 5. (1) IRR ranged between r = .60 (for oppositional problems) and r = .77 (for ADHD problems) [51]. TRR in a sample of 130 typically developing children was acceptable (r = .53–.87, mean r = .80 [28]). (2) Children with ADHD combined type differed significantly on six TOF scales from a clinically referred group and a typically developing group. An overall correct classification rate of 74% was reached for the combined type versus a clinically referred sample without ADHD. The predominantly inattentive subtype could not be validly discriminated from the non-ADHD referred sample and healthy controls [138]. (3) The TOF DSM-oriented scale of ADHD problems was significantly correlated with parent ratings of inattention (r = .19) and hyperactivity-impulsivity (r = .33) on the AD/HD rating scale-IV. Correlations with teacher-rated inattention (r = .21) and hyperactivity-impulsivity (r = .31) on the AD/HD rating scale-IV were also significant [51]. (4) The TOF has not been employed for treatment evaluation.

Hillside Behavior Rating Scale (HBRS)

The seven items of the HBRS were assessed during test sessions in one study. The items were rated after the completion of tests of intelligence and academic achievement in ADHD preschoolers [139]. (1) Significant IRR coefficients (r = .58–.68) were reached. TRR was not reported. (2) The composite ADHD score of the HBRS (with items directly corresponding to DSM-IV) was significantly higher in preschoolers with ADHD than in comparison children. HBRS ratings provided small but significant incremental validity in the prediction of functional impairment over parent and teacher reports [139]. Sensitivity and specificity were not evaluated. (3) Findings for convergent validity between the HBRS DSM-oriented ADHD scale and the number of ADHD symptoms reported by parents on the Diagnostic Interview Schedule for Children (DISC) and teachers on the DSM-IV version of the Disruptive Behavior Disorder (DBD) checklist ranged from r = .32 to .50. Correlation coefficients were higher for parent ratings [139]. (4) HBRS test session observations were not used for treatment evaluation.

Short Summary

Four different observational tools for observations of test behavior were used in a total of 27 studies (Table 5: GATSB [n = 1], HBRS [n = 1], RAS [n = 23], TOF [n = 2]).

  • IRR: mostly acceptable (agreement 67–92%, r = .58–.68, kappa = .48–1, ICC = .97–.99). The lowest percentage agreement was reported for the RAS [140, 142], the lowest Pearson r was reported for the HBRS [139], the lowest kappa was reported for the RAS [143]. Not reported in 13 studies.

  • TRR: reported for one instrument (RAS), ranging between r = .61 and .86 [147, 153]; ranging between r = .53 and .87 on the TOF for typically developing children [28].

  • Correct classification: ranged between 64% (RAS [125]) and 88% (GATSB [135]); analyzed for three instruments (GATSB, RAS, TOF) [122, 125, 135, 138].

  • Convergent validity: reported in four studies for three instruments (RAS, HBRS, TOF); small to moderate agreement with parent ratings (r = .19–.50), teacher ratings (r = .21–.38), and CPT scores (r = .26–.39) [6, 51, 134, 139].

  • Treatment outcome: significant pharmacological intervention effects were found using the RAS (n = 10); significant effects of a non-pharmacological intervention were found using the RAS [149].

Parent–Child Interaction Observations

Eleven studies conducted since 1990 have included behavioral observations of children or adolescents with ADHD while interacting with their parents in the laboratory (Table 6). Only observed child behavior (not parenting) is focused on here.

Table 6 Parent–child interaction observation studies of children and adolescents with ADHD (n = 11)
Disruptive Behavior Diagnostic Observation Schedule (DB-DOS)

The DB-DOS was applied in one study of Table 6. Extending the DB-DOS, ten items on ADHD symptoms were added, and a total of 31 items were then rated from 5-min taped interactions between the target child and the parent (parent context) or an examiner (examiner context) [158]. (1) IRR was good (ICC = .88–.95). TRR of the DB-DOS scales was moderate (ICC = .52–.80) in a group of mixed referred and typically developing children. (2) The ADHD scale of the DB-DOS reached sensitivity and specificity of 87 and 79%, respectively, as well as a 75% agreement between DB-DOS and best-estimate ADHD diagnosis. (3) Correlation coefficients between different parent and teacher ratings (Kiddie Disruptive Behavior Disorder Scale, Clinical Global Assessment Scale, CBCL, Teacher Report Form) and the ADHD scale of the DB-DOS were significant (r = .28–.42) and slightly more pronounced for parent ratings. (4) No reports on the sensitivity to change of the DB-DOS in ADHD are available.

Global Impressions of Parent–Child Interaction-Revised (GIPCI-R)

The GIPCI-R rating scale was applied in one study in preschoolers [160] and one in school-aged children with ADHD [129]. (1) The ratings of child behavior showed adequate IRR (r = .71–.84) [129], but lower IRR was achieved in the study in preschoolers (ICC = .48–.77) [160]. TRR was rather low (r = .41–.50) [129] (r = .20) [160]. (2) Predictive validity and (3) convergent validity were not evaluated for the GIPCI-R. (4) Parent training did not significantly improve GIPCI-R observed child behavior during parent–child interactions [129, 160].

Dyadic Parent–Child Interaction Coding System-Revised (DPICS-R)

The DPICS-R was applied in one study of Table 6 [120]. (1) It reached adequate IRR for child deviance and child positive behavior (ICC = .70 and .96, respectively) [120]. TRR, (2) predictive validity and (3) convergent validity were not reported. (4) A significant decrease in child deviance after combined parent and child training for preschoolers with ADHD was reported [120].

MTA Parent–Child Interaction

Wells et al. [159] investigated the effects of the multimodal treatment on four rated child behaviors during parent–child interactions (complaining, verbal abuse, compliance, likable). The same observational tool was applied in another non-pharmacological treatment study [161]. (1) The IRR for these direct ratings were reasonable (r = .62–.85) [159, 161]. TRR, (2) predictive validity, and (3) convergent validity were not reported. (4) Significant treatment-related changes in observed child behavior were reported by Babinski et al. [161], but not by Wells et al. [159].

Parent and Adolescent Interaction Coding System (PAICS)

The PAICS was applied in three studies with adolescents with ADHD [162, 163, 165]. The PAICS codes six behavior categories from transcribed discussions between adolescents and their parents. Typically, a 10-min neutral discussion about a vacation was followed by a 10-min discussion about conflicts. (1) Agreements between coders ranged between 53 and 85% [162, 163]. Two-week TRR was reported to be low [165] (although not specified numerically). (2) Between-group differences in negative communicative behavior between adolescents with and without ADHD were to a great extent accounted for by comorbid oppositional defiant behavior [162, 163]. The predictive validity and (3) convergent validity were not examined. (4) Changes in observed adolescent communicative behavior after different non-pharmacological interventions were not uniformly positive [165, 166].

Conflict Rating Scale (CRS)

The CRS [167] was originally used to rate marital conflict interactions. (1) It was applied with adequate IRR (ICC = .64–.82) in two studies of Table 6 to rate 15 dimensions of positive and negative communication during parent–teen conflict and neutral discussions in samples of adolescents with ADHD and ODD [164, 166]. TRR was not examined for the CRS. (2) The ADHD/ODD group showed significantly more negative behavior and less positive behavior than comparison teens during the conflict discussion [164] (sensitivity/specificity were not assessed). (3) Convergent validity was not assessed. (4) No uniformly positive treatment effects were found on the CRS-rated teen behavior after completion of communication training [166].

Short Summary

Six different observational tools for observations of parent–child interactions were used in a total of 11 studies (Table 6: CRS [n = 2], DB-DOS [n = 1], DPICS-R [n = 1], GIPCI [n = 2], MTA observation [n = 2], PAICS [n = 3]).

  • IRR: mostly acceptable (ICC = .48–.97, r = .71–.84, kappa = .68, agreement = 53–81%). The lowest percentage agreement and kappa were reported for the PAICS [162, 163], the lowest ICC was reported for the GIPCI-R [160]. All studies reported IRR.

  • TRR: reported for two instruments (DB-DOS [158], GIPCI-R [129, 160]), ranging between r = .20 and .50 for the GIPCI-R and between ICC = .52 and .80 for the DB-DOS.

  • Correct classification: DB-DOS had 75% agreement with ADHD diagnosis [158].

  • Convergent validity: reported for one instrument (DB-DOS); small to moderate agreements with parent ratings (r = .30–.42) and teacher ratings (r = .28–.32) [158].

  • Treatment outcome: significant effects of non-pharmacological interventions were found using the DPICS-R [120] and the MTA observational tool [161] (but not in [159]); no significant treatment effects were found using the GIPCI-R [129, 160], the CRS [166], or the PAICS [165].

Peer–Child Interaction Observations

Five studies applied a specific observational tool for observing peer–child interactions (Table 7).

Table 7 Peer–child interaction observation studies of children with ADHD (n = 5)
Test of Playfulness (ToP)

The ToP [173] is an observer-rated scale to assess the construct of playfulness, consisting of 29 items. (1) In non-ADHD samples, evidence of acceptable IRR and TRR (ICC = .67) for the ToP was found [174, 175]. (2) Children with ADHD scored significantly lower on the overall playfulness measure in the laboratory [169] as well as in the naturalistic setting [168]. Sensitivity/specificity was not examined. (3) The convergent validity was not reported for the ToP. (4) A significantly improved ToP overall score was reported in children with ADHD after the completion of an intense play-based intervention compared to the pre-intervention baseline [172].

Discussion

This review sought to comprehensively cover the current state of systematic direct observational tools that are used in the study of ADHD. In total, 135 research findings from 29 different systematic observational tools, published between 1990 and 2016, were summarized in tables. We systematically delineated the reliability characteristics and the evidence of clinical validity for 16 observational instruments from the naturalistic setting, and for 13 instruments from the laboratory setting. A summary thereof is provided in Table 1.

Naturalistic Versus Laboratory Settings

We found considerably more research on systematic observational tools from naturalistic contexts (n = 83) than from standardized laboratory settings (n = 52). This imbalance might likely be attributed to the advantageous ecological validity of classroom observations.

In total, 55 out of 83 (66%) naturalistic observation studies and 30 out of 52 (58%) laboratory observation studies reported IRR. Enhanced objectivity and comparability in the laboratory minimizes the problem of low inter-rater agreement, which was a more particular problem of naturalistic observations.

TRR has been examined more frequently for laboratory tools (7 out of 13) than for naturalistic observational tools (4 out of 16) and coefficients were in a slightly higher range in analogue laboratory settings (e.g., test session: r = .61–.86) than in naturalistic settings (e.g., classroom: r = .25–.77). Playground, lunchroom, and parent–child interactions were the least stable situations for observation.

Classification rates seemed to be slightly higher for classroom observational tools than for laboratory observational tools. In general, group-level differences were more frequently analyzed than classification rates.

Significant treatment effects were found with both naturalistic and laboratory observational tools.

Which Tools to Use

Classroom

Based on the reviewed reliability and validity information, classroom observations should be preferred over other types of naturalistic observations. The BOSS, the COC, the COCADD, the DOF, and the SKAMP provide tools that are based on a number of independent studies and some psychometric validation. Nevertheless, each system has its advantages and disadvantages. Generalizability and dependability analyses have provided important information on the reliability of the BOSS and the DOF. Moreover, the DOF is the only tool that provides norms. The SKAMP has revealed good scores for TRR—although measured on the same day—[58], but low IRR. Even though an age-related decline in observable ADHD behavior may be assumed [91], the COCADD provided evidence for lasting observable behavioral differences and significant improvement with medication for adolescent patients with ADHD [89, 90].

Test Session

In the laboratory, more structured situations for observations, such as test sessions, were proven to discriminate better between ADHD and controls than independent play observations. Moreover, non-pharmacological interventions did not consistently cause a change in observed play behavior. Therefore, test session observations should be favored for studying ADHD behavior. The RAS and the TOF provide adequate tools for this purpose. The TOF has the advantage of providing norms. The RAS, however, is based on more evidence than the TOF. The RAS can be applied to observe behavior during academic seatwork and during CPTs. RAS variables were suggested to provide even better discrimination between ADHD and controls than actual task performance [6, 133, 176]. Nonetheless, problems of the RAS lie in the low IRR coefficients that have been occasionally reported (e.g., [140, 142, 143]) and in the fact that group differences were not uniformly found in the same RAS variables. Furthermore, adolescents with ADHD could not be distinguished from clinically referred participants [152], which calls into question the specificity of the RAS behaviors for ADHD. However, effect sizes to detect stimulant-induced change were larger for RAS measures than for parent and teacher ratings [148].

Laboratory Interactions

Parent–child interactions were found to be rather unstable and results were mixed regarding the sensitivity to change of parent–child interaction observational tools. The use of these tools as treatment outcome measures is compounded by the possible difficulty of disentangling effects on parenting from effects on child behavior. This interdependency may also be responsible for the low stability. Furthermore, it must be kept in mind that parent-adolescent interactions seem to provide a measure of ODD symptoms rather than of ADHD [162, 163]. Nonetheless, for adolescents, the CRS and the PAICS are likely to be useful, while for younger children, all reviewed tools seem to have comparable utility. Interactions with a non-familiar adult could provide a more controllable alternative for highly unstable parent–child interaction observations, as the DB-DOS experimenter contexts reached better reliability coefficients (IRR, TRR, Cronbach’s alpha) than the DB-DOS parent context [158] (see also [177, 178]). The evidence base for peer–child observations with the ToP in ADHD is not sufficiently established.

General Methodological Issues and Suggestions for Future Research

The present review revealed several issues to be resolved in future research. First, all studies need to formally assess and report IRR and to provide adequate training for observers. In particular, more consistent reporting of IRR should be aimed at for the SKAMP, in view of the frequent use of this observational scale in medication trials. For time-sampling procedures, kappa coefficients should be preferred over percentage agreement for the analysis of IRR. In particular, the RAS lacks reports of kappa coefficients.

Second, the TRR should be assessed more consistently. Crucially, stability of behavior should be investigated within ADHD groups separately, because evidence strongly suggests increased behavioral variability in ADHD [179, 180]. In addition, naturalistic settings are particularly vulnerable to the impact of uncontrollable contextual factors. Influences such as the time of day, the academic subject, the teaching method, or even the time in the school year create potential biases to the reliability of observed behavior [39, 46, 88].

There is a particular lack of reports on the convergent validity for play observations and parent–child interaction observations. Otherwise, agreements with parent and teacher reports of ADHD symptoms are typically small to moderate and classification rates hardly exceeded 80%. Therefore, we conclude that none of the reviewed observational instruments can be applied as a stand-alone diagnostic procedure. Analyses on the incremental validity of observational tools revealed negligible contributions to the prediction of ADHD or functional impairment over and above parent and teacher reports [51, 139]. However, a problem of circularity compounds the predictive validity of observations because the diagnostic criteria of ADHD are primarily based on parent and teacher interviews [138], and not on an objective, absolute quantification of observable behavior. Therefore, it may be rather challenging to obtain objective measures that reach comparable clinical validity in ADHD diagnosis to parent and teacher ratings. Moderate degrees of agreement suggest that behavioral observations target unique aspects of problematic behavior that are not covered by parent and teacher reports alone. Observational data might therefore aid the interpretation of inconsistencies between different sources of ADHD ratings. However, systematic behavioral observations cannot act as a substitute for parent and teacher ratings.

Although the evidence indicates that observational methods are not appropriate for diagnosing ADHD when applied as a stand-alone approach, this does not preclude their potential value for assessing treatment outcome [11]. Significant improvements in observed behavior after treatment have been reported for most tools (see Table 1). It is debatable whether this is sufficient to assume treatment sensitivity for these methods (see [181]). Clearly, study designs that lack observations in an untreated control group (e.g., [84, 172]) or observation of pre-treatment behavior (e.g., [83]) should be avoided. Dependability studies suggest that pre-post designs with one observation each may not be sufficiently reliable to monitor treatment effects on classroom behavior [94, 95]. Normalization rates and the clinical significance of change should be more consistently reported and norms should be established. Furthermore, common instruments used for the evaluation of behavioral treatments (i.e., BOSS) should be validated with regard to pharmacological effects and vice versa (i.e., SKAMP). Observational methods have earned the reputation of being the gold standard for treatment evaluation [14]. Based on the present review, however, this presumption might have to be reconsidered. Issues of low temporal stability, considerable behavioral variability, unknown TRR of most instruments, and the lack of normative data impede the thorough evaluation of the sensitivity to change of these methods.

Observee reactivity poses a further problem for behavioral observations [182,183,184]. This phenomenon occurs if behavior is altered due to the awareness of being observed [182, 184]. Efforts should therefore be made to conduct observations as unobtrusively as possible [183] and studies should consistently specify how the presence of observers or video cameras was explained to the participants. First-grade students were not found to show reactivity to observers in the classroom [185]. Similar investigations would be necessary to ascertain the impact of observer reactivity in ADHD samples and adolescents.

In general, adolescent ADHD patients seem to be underrepresented in studies using observational methods (13 out of 135 studies [10%]). Validation of instruments with a specific focus on this age group would be highly desirable.

The confounding effect of comorbidity was not sufficiently addressed in many of the reviewed studies. Comorbid disruptive behavior disorders augmented the levels of observed dependent measures in some cases (e.g., [36, 111, 162]), but not all (e.g., [107, 158]). Not only the influence of comorbid disorders, but also the differentiation between different psychopathological groups needs to be more systematically analyzed in the future. Behaviors such as classroom aggression or noncompliance clearly overlap with symptoms of other externalizing behavior disorders such as ODD or CD. Moreover, students with learning disabilities were also found to exhibit elevated levels of off-task behavior and disruptive behavior in the classroom (for review see [186]), and children with ASD displayed high amounts of out-of-seat behavior [187]. Hence, the specificity of such behaviors for ADHD is questionable and needs to be taken into account when interpreting observational data.

Based on these considerations, we recommend considering the following critical factors for planning and conducting systematic behavioral observations:

  1. (1)

    Is satisfactory IRR established through observer training?

  2. (2)

    Is IRR formally assessed and reported? Is the kappa coefficient indicated if a standardized sampling procedure is applied?

  3. (3)

    Are observers blind with regard to the subject status and the treatment condition?

  4. (4)

    Is observee reactivity controlled for as effectively as possible?

  5. (5)

    Is a sufficient amount and duration of observation episodes assured? Are dependability studies taken into account (see [94, 95])?

  6. (6)

    Are situational influences controlled for (e.g., time of day, school subject, teacher)?

  7. (7)

    Is an adequate control group included that is observed with the same intensity and frequency as the experimental group?

  8. (8)

    Is the clinical significance of behavioral change (i.e., normalization) evaluated?

  9. (9)

    Are other measures of ADHD symptoms included? To what extent do these reflect the observational findings?

Summary

This review evaluated the clinical utility of observational methods in the research in children and adolescents with ADHD. Twenty-nine instruments for observing classroom behavior (11 tools), naturalistic social interactions (5 tools), independent play (2 tools), test session behavior (4 tools), parent–child interactions (6 tools), or peer interactions (1 tool) were reviewed. Tools for classroom and test session observations showed the most promising psychometric properties. The RAS and the TOF may be recommended for test session observations. The BOSS, the COC, the COCADD, the DOF, and the SKAMP seem to be reasonable choices for the study of ADHD classroom behavior. However, the psychometric properties of all of these instruments need more systematic validation.

Future research should intensify the investigation of the discriminative validity of observational measures with regard to different comorbid groups, other psychiatric disorders (e.g., learning disorder, ASD), and clinically referred groups. The incremental validity of observations over other diagnostic methods should be assessed more consistently and efforts should be made to obtain normative data for observational instruments. Furthermore, many observational instruments lack a report of TRR and/or dependability. Treatment-related changes in observed behavior should be cross-validated with other instruments and the concurrent validity between different observational tools should be established.