Introduction

The diagnostic assessment of Autism Spectrum Disorder (ASD) in children is a complex process, in which information is gathered from parents (or caregivers) about the child’s developmental history and current level of functioning, together with first-hand observations by an experienced clinician [1,2,3,4]. Standardized semi-structured observation instruments and parental interviews are now widely used in this information-gathering process [1]. The narrow use of instruments, such as in only administering the algorithm items or focusing solely on the algorithm’s outcomes for the purpose of diagnosis, should never be used to decide diagnostic classification, but instead diagnostic classification should rely on the integration of different sources of information, including a parental interview and a child observation and from different contexts [2, 5, 6]. Nevertheless, in a research context, clinical diagnoses of participants’ are sometimes validated using a semi-structured observation instrument and/or a parental interview, and sometimes participants who do not meet the threshold are even excluded from the research sample, which could lead to a biased understanding. If we are consistently excluding individuals from research based on one particular sub-criterion (due to the fact that the instruments do not adequately measure it), we may not be best representing individuals with difficulties in that particular area. It is, therefore, important to study to what extent behaviors described by the DSM-5 criteria are represented in diagnostic assessment instruments for ASD, as well as the procedures by which a classification according to these DSM-5 criteria could be implemented. Gaining insight into the content validity of the algorithms can help clinicians understand why an individual meets the threshold on a specific instrument (or fails to do so), so that they can seek converging (or diverging) information of other sources. In this way, our study could be important in both supporting clinicians’ decision-making processes and in facilitating parity of research samples recruited according to the DSM-5 criteria.

Autism spectrum disorder in DSM-5

The latest edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) [7] includes significant changes to the diagnostic criteria for ASD. While DSM-IV-TR [8] delineated five different sub-classifications, DSM-5 abandoned those sub-classifications in favor of one single classification, Autism Spectrum Disorder (ASD). Additional changes were related to the diagnostic criteria. Instead of a triad of impairments, DSM-5 characterized ASD by deficits in two core domains: (1) impairments in social interaction and social communication, and (2) repetitive and restricted patterns of activity, behaviors and interests (RRBIs). More specifically, to meet DSM-5 criteria for ASD, individuals are required to meet all three sub-criteria within the social interaction and social communication domain, and two out of four of the sub-criteria within the RRBI domain (for more details, see Appendix A). The latter rule gave greater significance to RRBIs; in DSM-IV-TR, only one of four RRBI sub-criteria had to be met. In addition, the number of possible combinations of sub-criteria that would qualify for an ASD diagnosis was limited from 2027 for a DSM-IV-TR diagnosis to 11 possible combinations for a DSM-5 diagnosis [9]. Furthermore, sensory problems were added as a new symptom within the RRBI domain, and language problems were removed from the core ASD symptoms and considered instead as co-occurring difficulties (like intellectual disability) that can be indicated with a specifier to describe an individual’s profile. Finally, DSM-5 stipulates levels of severity for both domains of impairment based on the required level of support.

Such a change in diagnostic criteria could significantly alter the characterization of autism with consequences for the number of individuals being diagnosed. Although DSM-5 explicitly states that individuals previously diagnosed with Autistic Disorder or Asperger’s Disorder should qualify for a DSM-5 diagnosis of ASD, meta-analyses and literature studies suggest that a significant proportion of individuals who met DSM-IV-TR criteria will fail to meet DSM-5 criteria for ASD, especially those with a diagnosis of PDD-NOS or Asperger’s Disorder [10,11,12].

Aims of the current study

DSM-5 was published in 2013 [7]. Recently, the International Classification of Diseases (ICD) has also published its novel guideline ICD-11 [13], paralleling DSM-5. Given that some authors have suggested that application of the new DSM-5 criteria can result in a shift and a decrease of ASD diagnoses (for a review, see [11]), the aim of the study was to document the effect of DSM-5 changes on existing diagnostic instruments that have been designed to guide diagnostic judgements. Specifically, the purpose of the paper is to systematically identify the way that these instruments operationally define diagnostic criteria and sub-criteria and if there is consistency between these instruments in the way that behavior is operationalized to match DSM-5 criteria. This paper is not aimed at evaluating the correctness of diagnostic classifications (as empirical studies of psychometric properties do), but this paper aims at the characterization of ASD behaviors and the consistency in the way they are operationalized. This operationalization or content validity also contributes to the clinical utility of an instrument, as it is crucial that clinicians/researchers gain insight into which sub-criteria are covered by the instruments, and how the algorithm is developed. Such insight can help clinicians to understand and analyze why an individual meets the threshold on a specific instrument (or fails to do so) and to seek evidence related to the not-covered criteria.

Three diagnostic instruments have developed new algorithms, specifically designed to measure DSM-5 criteria characteristics, but it is not yet clear whether these three instruments cover DSM-5 symptoms to the same extent and whether different procedures used by the three algorithms can lead to different diagnostic outcomes. As part of the development of some of the instruments, the specific instrument items were mapped onto DSM-5 criteria [for ADOS-2, see 14, 15, for DISCO-11, see 16] [14,15,16], which is also presented in the Results section. The revised DSM-5 adapted algorithms of these instruments (or preliminary versions of them) demonstrated good psychometric properties [17–22, for a systematic review of psychometric characteristics of instruments available for preschoolers, see 23] [17,18,19,20,21,22,23]. Previously, Huerta and colleagues [14] have studied the content validity of the ADOS-2, and concluded that the instrument did not cover all sub-criteria for ASD, but the DSM-5 algorithms of different instruments were not yet compared directly in terms of content validity. Therefore, the first goal of the study was to establish the content validity of these three DSM-5-adapted algorithms. The second was to evaluate the clarity of the DSM-5 criteria themselves and identify possible pitfalls when operationalizing the DSM-5 (sub-)criteria into concrete measurable behaviors. In this way, we hope to guide future improvements in diagnostic instruments and classification systems.

Method

Procedure

Selection of instruments

Autism-specific diagnostic interviews and observation schedules for children and adolescents with a wide age range were selected from the guidelines for diagnosis of autism developed by the National Institute for Health and Care Excellence [5], excluding screening instruments and questionnaires. Only instruments with newly developed DSM-5-based scoring principles were included, yielding the following three instruments: the Autism Diagnostic Observation Schedule—Second Edition (ADOS-2; [21, 22]), the Developmental, Dimensional and Diagnostic Interview (3di; [24]) and the Diagnostic Interview for Social and Communication Disorders—11th edition (DISCO-11; [25]). The Autism Diagnostic Interview—Revised (ADI-R) was not included, because, to our knowledge, no DSM-5 adapted algorithm for children and adolescents has been published, except for the adapted algorithm for children aged between 12 and 47 months [26]. Our decision to only include instruments with a wide age range was based on the desire to be as inclusive as possible in representing how ASD characteristics across a broad developmental span, while at the same time, enabling comparisons across instruments.

Item mapping

Items included in the DSM-5 algorithms of the instruments (for ADOS-2, see [21, 22]); for 3di, see [27]; for DISCO, see [17]) were compared to the DSM-5 description of (sub-)criteria and exemplars, taking into account full item descriptions and coding options, and independent of the classification according to the instrument. Two raters (KE and JM), experienced in the diagnostic assessment of children with ASD, independently categorized all items. A multidisciplinary expert panel, consisting of KE, JM, IN, WM and AD, a group of professionals that is highly experienced in the diagnostic assessment of ASD both in the context of research and in clinical practice, discussed items when: (1) there was disagreement between the two raters, or (2) the categorization by the two raters was different from the categorization according to the instrument. Final decisions were based on the panel discussion. All expert panel members were trained in the assessment and coding of at least two of the three instruments.

Evaluation of algorithm classifications

Algorithm classification procedures of the evaluated instruments were compared to DSM-5 (sub-)criteria and diagnostic decision-making rules.

Identifying difficulties when operationalizing DSM-5 (sub-)criteria into behaviors

The discussion of the difficulties relating to the clarity of the DSM-5 criteria (as per aim 2, for results, see “Discussion”) was based on expert panel discussion. Items were discussed by the expert panel when there was disagreement between two raters, or when the categorization by two raters was different from the categorization according to the instrument. In addition, individual expert panel members also noted items for which they were unsure of the classification, and those items also contributed to the discussion.

Instruments

Autism diagnostic observation schedule (ADOS-2)

The ADOS-2 [21, 22] is a semi-structured, standardized observational assessment in which toys, activities and/or conversations are used to elicit communication, social interaction, play, and repetitive and stereotyped behaviors relevant to the diagnosis of ASD. Administration consists of direct observation by a trained examiner in a one-on-one situation (except for young children in the Toddler Module, and Modules 1 and 2, when a familiar adult is present as well). The ADOS-2 can be used to assess individuals from all ages and levels of functioning and offers five different modules and eight different algorithms, from which one module and algorithm is selected based on the individual’s expressive language level and chronological age. Observations of the individual’s behavior are coded on 28 to 41 items (depending on the module chosen), usually on a scale from 0 to 2, with higher scores indicating greater symptom severity. The administration of the ADOS-2 takes approximately 40 to 60 min. Revised DSM-5-adapted algorithms were published in the ADOS-2 manual (for all modules apart from Module 4), with a sensitivity between 0.60 and 0.95, and specificity between 0.75 and 1 (depending on the administered module; [21, 22]). For Module 4 (for fluently verbal adolescents and adults), the DSM-5 algorithm is not integrated in the instruments’ manual yet, but a research version has been published and demonstrated overall sensitivity between 84.6 and 90.5, and specificity between 72.1 and 82.2 [19, 20].

Developmental, dimensional and diagnostic interview (3di)

The 3di [24] is a computerized parental (or caregiver) interview that is a hybrid of a fully structured and a semi-structured interview. A trained examiner collects information of the individual’s developmental history and of a broad range of skills and behaviors that are relevant not only for an ASD diagnosis, but also for co-occurring problems. Prior to the interview, the examiner imputes identifying information, which tailors the wording of questions. Scoring broader, complex questions is not required: Such questions were broken down into more specific items, to increase reliability [24]. The 3di was primarily designed to assess individuals aged 2–21 years with normal-range intellectual abilities, but it may also be used among those with intellectual disability and recently, an adult version of the interview was also published [28]. The 3di comprises more than 700 questions that are grouped in 23 different sections. The number of questions included in the interview has increased over the years, with different research groups adding new questions on specific DSM-5-related topics, hereby generating different parallel versions of the full interview instrument. Interviewers almost never administer every question: the 3di is constructed of different modules, each including a subset of questions. Depending upon the purpose and/or suspicion of co-occurring problems, the full autism module might be complemented with one of the modules on co-occurring problems. The majority of questions concerning atypical behaviors are coded on a 3-point severity scale: 0 (described behavior is not present), 1 (minimal evidence of described behavior), and 2 (definite or persistent evidence of described behavior). The 3di assumes the interviewer to rate whether behavior “ever” or “now” occurred. Administration time strongly depends on the selected module, ranging from 45 min (short version; [29]) to 2 h. Psychometric properties for the DSM-5 version of 3di [27] have not yet been investigated. Classifications based on a preliminary version of this algorithm were compared to ADOS-2-classifications, showing a sensitivity of 0.84 and a specificity of 0.54 [18].

Diagnostic interview for social and communication disorders (DISCO-11)

The DISCO-11 [25] is a semi-structured parental (or caregiver) interview, in which a trained examiner collects information about an individual’s developmental history and a broad range of skills and behaviors relevant for an ASD diagnosis, but information on other domains is also collected. Individuals from all ages and levels of functioning can be assessed using the DISCO-11. The DISCO-11 comprises more than 300 items that are grouped in eight different sections. The majority of items concerning atypical behaviors are coded on a 3-point severity scale: 0 (marked problem), 1 (minor problem), and 2 (no problem). For most of these items the DISCO-11 distinguishes both ‘ever’ and ‘current’ ratings of the individual’s behavior. Other items are measuring the current level of functioning: the higher the level of achievement, the higher the score, with codes ranging between 0 and 12. Another type of items are about developmental milestones: for some, the actual age of achieving (in months) is coded, for others whether there was a delay in achieving specific developmental milestones. The last type of items rate the quality of behavior based on qualitative descriptions for each category (maximum of 10). Administering the complete interview takes approximately 2–3 h, but it is also possible to only complete the items relevant for the diagnostic algorithms, resulting in a shortened administration time (about 45–60 min; [16]). The DSM-5 algorithm has been shown to have a good sensitivity and specificity, ranging from 0.85 to 1.00 and 0.74 to 0.89, respectively, based on different samples [17].

Results

Given that none of instruments’ algorithms explicitly included items related to criterion D (‘Significant impairment in functioning’) or E (‘Not better explained by intellectual disability’), our analyses focus on criterion A (‘Deficits in social interaction and social communication’), B (‘RRBIs’), and C (‘Early onset’).

Item mapping

All items were mapped onto each of the DSM-5 sub-criteria for ASD by the two coders (see Table 1 and Appendix B for the more detailed item mappings for each instrument). Inter-rater agreement was high, with agreement between expert raters for 68 out of 70 ADOS-2 items (97%), 62 out of 63 items for 3di (98%), and 80 of 85 DISCO-11 items (94%). An additional 23 items were discussed in the expert panel, as they were categorized differently by the raters compared to the instrument: for ADOS-2, 1 out of 70 items (1%; but note that the instrument only categorizes items based on the two main criteria and not based on sub-criteria); for 3di, 10 out of 63 items (16%); and for DISCO-11, 12 out of 85 items (14%).

Table 1 Summary of ADOS-2, 3di and DISCO-11 item mappings on DSM-5 (sub-)criteria for ASD

Autism diagnostic observation schedule (ADOS-2)

The ADOS-2 groups items into ‘Social Affect’ and ‘Restrictive and Repetitive behaviors’, without further specifications regarding DSM-5 sub-criteria. This division into ‘Social Affect’ and ‘Restrictive and Repetitive behaviors’ parallels our item mapping on criteria A and B, although one item was categorized differently: whereas the ADOS-2 manual categorized ‘Reporting of events’ (in which the ability is evaluated to describe a non-routine event in an understandable manner, an item that is only included in Modules 3 and 4) under ‘Social Affect’, our item mapping did not organize this item in any of the sub-criteria, as it mainly reflects the level of expressive language skills, an aspect that is no longer part of the DSM-5 criteria.

Our analysis shows that the ADOS-2 DSM-5 algorithm mainly focuses on criterion A symptoms, and more specifically on A1 (‘Deficits in socio-emotional reciprocity’) and A2 (‘Deficits in nonverbal communication’). Only 3–5 (out of 14–15) items cover the criterion B symptoms, with an emphasis on ‘Stereotyped and repetitive behaviors’ (B1). There are at most two ADOS-2 items measuring symptoms in the area of ‘Deficits in relationships’ (A3) and no items on ‘Insistence on sameness and routines’ (B2). Items both for A3 (‘Deficits in relationships’, for example, item ‘Insight into social relationships’) and for B2 (for example, item ‘Compulsions or rituals’) are available in the instruments, but these items are not included in most modules’ algorithms. No indicators for early onset (criterion C) are available and the ADOS-2 focuses on current behaviors and does not include the presence of criterion A (‘Deficits in social communication and interaction’) or B (‘Restricted, repetitive behaviors, interests or activities’) symptoms in the past.

Developmental, dimensional and diagnostic interview (3di)

The developers of the 3di used clinical agreement to classify DSM-5 algorithm items into the specific sub-criteria, but factor analysis was not used for confirming this selection. We categorized 10 out of 63 items differently compared to the instrument. More details can be found in Appendix B.

All sub-criteria within A (‘Deficits in social communication and interaction’) and B (‘Restricted, repetitive behaviors, interests or activities’) are measured by at least five or more questions of the 3di. Multiple questions are used to assess the same exemplars. For example, three questions are included about sharing objects or food (A1, ‘Deficits in social-emotional reciprocity’), five questions about stereotyped and repetitive speech (B1, ‘Stereotyped and repetitive behaviors’), and seven questions about hypersensitivity to sounds (B4, ‘Hyper- or hyporeactivity’). On the other hand, some exemplars are not covered, such as ‘Failure to initiate or respond to social interactions’ (under A1, ‘Deficits in social-emotional reciprocity’), ‘Difficulties with transitions’ (under B2, ‘Insistence on sameness and routines’), or ‘Apparent indifference to pain/temperature’ (under B4, ‘Hyper- or hyporeactivity’). Even though the instrument comprises an extensive developmental history, no items on criterion C (‘Early onset’) are included in the algorithm. For all items, interviewers should take into account both current and past behaviors when attributing a score, matching the specification in DSM-5 that criteria can be met currently or by history, as long as the total presentation is currently impairing.

Diagnostic interview for social and communication disorders (DISCO-11)

Based on clinical agreement, the developers of the algorithm mapped items from the DISCO to the DSM-5 sub-criteria based on clinical agreement and the item selection has not yet been validated using factor analyses [17]. Twelve out of 85 items were categorized differently in the current analyses compared to the original organization of items by the authors (for more details, see Appendix B).

All sub-criteria under criteria A and B are covered by seven or more DISCO items each. Some items are not applicable for younger children (< 4 years or < 6/7 years), these items are mainly related to ‘Deficits in relationships’ (A3). However, for younger children, six items remain applicable to measure sub-criterion A3. Within B1 (‘Stereotyped and repetitive behaviors’) and B2 (‘Insistence on sameness and routines’), five and three items, respectively, cannot be coded for minimally verbal individuals, but all other items remain applicable. The different items belonging to a specific sub-criterion cover the full range of exemplars. However, more items are available in the interview that could be used to extend and maybe even improve the algorithms, in particular for younger and minimally verbal individuals. The algorithm also includes items on early onset (criterion C), and for most criterion A and B symptoms separate scoring of current and past behaviors is required.

Analysis of the algorithms

The development (i.e., the item-selection procedure, the procedure used to set cut-off scores), decision-making rules and classification by the algorithm of the three instruments were compared to DSM-5 decision-making rules for ASD (see Tables 2 and 3). To review the instruments’ algorithms, the original authors' categorization of items under the DSM-5 sub-criteria was used. Therefore, the number of items for each sub-criterion shown in Table 2 might differ from Table 1 (mapping by our expert panel).

Table 2 ADOS-2, 3di and DISCO-11 algorithm computation and classification compared to DSM-5 criteria and sub-criteria for ASD
Table 3 Summary of the comparison of the DSM-5 algorithms of ADOS-2, 3di and DISCO-11

Autism diagnostic observation schedule (ADOS-2)

The DSM-5 algorithm of the ADOS-2 has been constructed by subdividing the standardization sample into five different groups based on age and verbal level corresponding with the five new ADOS-2 algorithms. Items were included based on their ability to distinguish individuals with autism from those without autism and comparability of concepts between modules [21, 22]. Algorithm items were subdivided in two different domains, ‘Social Affect’ (SA) and ‘Restricted and Repetitive Behaviors’ (RRB), based on exploratory and confirmatory factor analyses [30, 31].

To compute an ADOS-2 classification, all (recoded) A (‘Deficits in social communication and interaction’) and B (‘Restricted, repetitive behavior, interests and activities’) algorithm items are added and compared to one cut-off value. Such a classification procedure is not consistent with DSM-5 criteria and decision-making rules, as the ADOS-2 algorithm has no separate cut-off for criterion A (or SA) and criterion B (or RRB). An ADOS-2 classification of ASD can hence be provided based on criterion A (‘Deficits in social communication and interaction’) symptoms only. Furthermore, the skewed distribution of items over the different (sub-)criteria (see Table 2) could influence the final classification; for instance individuals with more severe problems in social-emotional reciprocity and nonverbal communication (and no RRBIs) are more likely to reach the threshold than individuals with less pronounced socio-communicative problems and many RRBIs.

ADOS-2 provides different cut-off scores for each module, and thus for different age groups and levels of ability. The instrument also implements some indices of severity. First, ADOS-2 distinguishes between overall cut-off scores for the classifications ‘autism spectrum’ and for ‘autism’, the latter referring to a more stringent cut-off. Second, overall raw total scores can be converted into a comparison score to estimate ASD symptom severity on a 10-point scale [21, 22]. Severity scores for domain totals (SA and RRB) are available in academic publications [32, 33], but are not included in the instrument’s manual and might, therefore, be unknown to clinicians. Moreover, it is not yet clear how these specific severity scores for SA and RRB relate to the three severity levels for criteria A and B as described in DSM-5 [33, 34].

Developmental, dimensional and diagnostic interview (3di)

Different sets of items have been put forward to be included in the DSM-5 algorithms of the 3di [18, 27], but only one of those DSM-5 algorithms has been used in a peer-reviewed publication [27]. The full description of the algorithm was not included in the publication and is not integrated in the clinical software yet. Therefore, given the lack of transparency on how the algorithms were constructed and the underlying decision-making rules, it was necessary to obtain the algorithm from the authors of the instrument directly to carry out any analysis of it.

For the DSM-5 algorithm [27], 63 items were selected from the full set of items included in DSM-IV-TR algorithm [24], complemented by items from the Children’s Communication Checklist (CCC, items that are also included in the full version of the 3di; [35]) via a two-stage process (see Tables 2 and 3). In a first step, 3di subscales and items (belonging to the DSM-IV-TR algorithm plus items from the CCC, see Table C1) were selected based on their relevance with regards to DSM-5 behavior descriptions by the senior authors and developers of the original algorithm [27]. All (recoded) items were organized in a set of subscales and then organized under the DSM-5 sub-criteria (for the exact number of subscales and their names, see Table C1). In a second step, three items were selected for each subscale, to reach the highest possible internal consistency, based on Cronbach’s alpha. The same algorithm can be used for individuals of all levels of intellectual functioning under 18 years.

The 3di algorithm follows most of the DSM-5 decision-making rules for an ASD classification. Cut-offs were not based on statistical analyses, but based on consensus among authors [27]. First, cut-offs were determined for all subscales. Second, the threshold for the sub-criteria (A1, …, B4) was set on meeting the cut-off for at least one of the underlying subscales. Given the uneven distribution of subscales (and items) across the different sub-criteria, this decision-rule may have an effect on the classification; for example, the threshold for sub-criterion A1 (‘Deficits in social-emotional reciprocity’, with five subscales) is lower than the threshold for sub-criterion A3 (‘Deficits in relationships’, with three subscales) or B1 (‘Stereotyped and repetitive behaviors’, with two subscales). Third, and in line with DSM-5, a final classification of ASD requires scoring above the cut-off on all three sub-criteria of criterion A (‘Deficits in social communication and interactions’), and two out of four sub-criteria within criterion B (‘Restricted and repetitive behavior, interests, and activities’). Even though the 3di includes elaborate information on developmental history, no information on ‘Presence of behaviors in early development’ (criterion C) is included in the algorithm, and although both present and past presence of symptoms should be taken into account when rating, no explicit distinction is made between them. The 3di DSM-5 algorithm does not offer information on ASD severity.

Diagnostic interview for social and communication disorders (DISCO-11)

For the DSM-5 algorithm of the DISCO-11 [17], 85 items were selected based on their relevance with regards to DSM-5 sub-criteria and exemplars (see Tables 2 and 3). Item selection was done by researchers and reviewed by a panel of independent clinicians. The algorithm recodes original item codings into present or not present. Algorithm thresholds for sub-criteria were defined based on ROC curve analyses. This DSM-5 algorithm had comparable sensitivity and specificity across the different age and ability levels tested. The DISCO-11 strictly follows all DSM-5 decision-making rules for ASD classification. First, separate cut-offs for criteria A (‘Deficits in social communication and interaction’), B (‘Restricted and repetitive behavior, interests, and activities’), and C (‘Early onset’) are used. Second, all A sub-criteria and two out of four B sub-criteria have to be met to obtain a classification. As defined in DSM-5, behaviors based on current descriptions or by history are taken into account in the ‘ever’ classification of DISCO-11. A final ASD classification is only possible when all three criteria (A, B, and C) are met separately. Note that our item mapping suggested that most of the items included in criterion C are not consistent with DSM-5 (items related to development of language and pretend play, for more details, see Item mapping and Appendix D), which might have an effect on the classification. The DISCO DSM-5 algorithm does not offer any information on severity of ASD symptoms.

Discussion

The first aim of this study was to establish the content validity of three diagnostic assessment instruments in relation to the DSM-5 algorithms for ASD, namely ADOS-2 [21, 22], 3di [27], and DISCO-11 [17] and the second aim was to identify potential problems with the operationalization of DSM-5 diagnostic (sub-)criteria for ASD. Our analyses showed that the three instruments do not cover all ASD symptoms to the same extent and that their diagnostic classification procedures are not always in line with the DSM-5 ASD criteria. Furthermore, the interpretation of the DSM-5 behavioral A (‘Deficits in social communication and interactions’) and B (‘Restricted and repetitive behavior, interests, and activities’) criteria is sometimes ambiguous and the other criteria (C—‘Early onset’, D—‘Significant impact on daily life functioning’, and E—‘Not better explained by other developmental diagnosis’) are not clearly defined.

The three instruments do not cover all ASD symptoms to the same extent

Differences in the nature of the instruments, their history and the development of the DSM-5 adapted algorithms can explain some of the variability in the symptoms included in the instruments’ algorithms. More specifically, an observation scale such as ADOS-2 cannot include items on developmental history, whereas these items are available in the parental interviews, but not always included in the algorithms. In addition, the likelihood of observing less frequent, yet highly salient and clinically significant RRBIs is limited during the 45-min time-window of the ADOS-2, which probably explains the under-representation of criterion B items in ADOS-2, compared to the interview instruments [33]. The absence of RRBIs in such a context should be interpreted with caution, as those behaviors might only occur under highly specific circumstances [33]. Similarly, authors state that it might be hard to capture deficits in building and maintaining relationships in a time-limited standardized observation, which could explain why this sub-criterion is (almost) absent in some ADOS-2 modules [33]. However, observation instruments provide the clinician with unique first-hand observations of the child. Whereas both parental interviews do a good job of representing all different sub-criteria, the ADOS-2 does not cover all sub-criteria (a finding that is in line with a previous item mapping by Huerta and colleagues [14]), which is partially due to the limitations of a time-constraint observation instrument.

The number of items included in the algorithms differed significantly across the three instruments. However, comparing the absolute number of items does not do justice to the instruments, as their items vary greatly in how broadly they are formulated. Whereas the items in ADOS-2 mostly refer to a broader area of functioning (e.g., ‘Using gestures’), the 3di consists of highly specific questions (e.g., ‘Shaking head for no’; ‘Nodding head for yes’). In this regard, the DISCO-11 takes an intermediate position (e.g., ‘Shaking or nodding head’). It is hence evident that ADOS-2 consists of fewer items than 3di or DISCO-11. Furthermore, compared to DISCO-11, the 3di emphasizes specific exemplars, which is probably partially related to the development of its DSM-5 algorithm, starting from existing subscales (see Results). Consequently, the 3di provides an elaborate picture of some specific DSM-5 exemplars, while other symptoms remain unexplored (e.g., seven out of ten B4 items focus on auditory sensitivity, but no items related to indifference to pain or temperature are included). Although exemplars do not represent an exhaustive list of symptoms within a specific criterion, the distribution of items across different exemplars is important to capture a range of different symptoms.

The nature and history of the different instruments can partly explain why not all DSM-5 criteria are represented in the different instruments. More items might also be required to capture the range of impairments in some sub-criteria. Although some empirical findings indeed suggest that it might be harder to capture socio-communicative problems in a few items compared to problems related to RRBIs [36], our item mapping demonstrated that instrument-specific mechanisms also play an important role: Whereas the algorithms of the ADOS-2 (as acknowledged by the authors of ADOS-2 in the manual) and the 3di consisted of more socio-communicative items compared to RRBI symptoms, the opposite pattern was found for the DISCO-11, where somewhat more items measured RRBI symptoms than problems with social interaction and communication. Moreover, as individual items may be more characteristic of particular subgroups of individuals, including a broader range of items could, therefore, improve sensitivity for different subgroups of individuals.

The differences in how the DSM-5 criteria are represented in the different instruments, and particularly the different limitations and advantages of parental interviews and observation scales, highlight that the combination of different diagnostic instruments increases their predictive value [6, 14]. Indeed, neither observation nor parental interviews should be the sole instrument used in diagnostic decision-making. At a minimum, clinicians should be aware of the limitations of specific instruments and use additional sources of information to address these limitations. For example, peer interactions are not evaluated by the ADOS-2 to provide an insight into peer relations. In this case, ADOS-2 information could be complemented with information from the semi-structured interviews.

The interpretation of DSM-5 behavioral criteria for ASD is sometimes ambiguous

The expert panel experienced some difficulties in assigning the items to the different (sub-)criteria, and the areas of greatest disagreement and discussion between raters are highlighted in this section. Taken together, our item mapping raised questions concerning the exact meaning of ASD symptoms as described in DSM-5, and their operationalization into concrete and measurable/observable behaviors. Within criterion A, the distinction between A1 (‘Deficits in social-emotional reciprocity’) and A3 (‘Deficits in developing, maintaining and understanding relationships’) appeared especially difficult [also see 14], as nearly all behaviors under A1 seem to be requirements for building and maintaining friendships (A3), although other reasons for deficits in A3 are possible as well. There does not only appear to be a hierarchical relationship between A1 and A3 symptoms, but they are also quite hard to distinguish from each other, as was reflected in the number of disagreements between our item mapping and the original placement of items for these sub-criteria. For instance, solely based on the sub-criteria and exemplars it is difficult to differentiate A1 exemplars ‘not being able to maintain a reciprocal back-and-forth conversation’ and ‘a failure in the initiation or response to social behaviors’ from A3 exemplars ‘difficulties in adjusting behavior to suit various social contexts’ and ‘an absence of interest in peers’. Although some differentiation seems to be possible, individual items could equally map across these different descriptions and, therefore, map to A1 and A3. Hence it might be difficult to unravel and separately measure these different symptoms in research.

A considerable proportion of individuals with ASD is nonverbal or minimally verbal [37, 38]. However, the distinction between A2 and A1 can be especially difficult in this group. For example, when a nonverbal individual does not point to share an interest, should that behavior be considered as a problem in nonverbal communicative behavior used for social interaction (A2) or as a deficit in social-emotional reciprocity (A1)? Behaviors like joint attention or sharing enjoyment are all nonverbal social behaviors. Clear guidelines are lacking on how to differentiate A1 and A2 in a population that uses nonverbal behaviors as their primary mode of communication. A1 focuses on reciprocity, regardless of the modality (verbal or nonverbal), whereas A2 covers the quantity and quality of nonverbal behaviors serving the social interaction. Diagnostic instruments should try to distinguish those aspects of (non)verbal behaviors in different items or questions.

Moreover, it appeared sometimes problematic to distinguish between symptoms related to B1 (‘Stereotyped and repetitive behaviors’), B2 (‘Insistence on sameness’) and B3 (‘Highly restricted, fixated interests’). For example, a child with an especially strong interest in a specific animation series (B3), who imitates entire conversations from that series (B1) and insists on watching the series every evening at seven o’clock (B2), could reach threshold on three B criteria based on one fixated interest that prevails in other aspects of functioning. In these cases, it remains unclear how clinicians or researchers should categorize such complex behaviors. On the one hand, it appears unfair to code one set of behaviors under multiple sub-criteria, as individuals will reach diagnostic thresholds—if impairing across contexts—very quickly based on one complex behavior. On the other hand, guidelines are lacking on which sub-criteria should be prioritized over others in these instances.

Taken together, mapping the instruments’ items onto DSM-5 diagnostic criteria revealed difficulties in the operationalization in clear, measurable or observable behaviors and the distinction between specific sub-criteria. By no means are we pleading for a checklist of concrete symptoms that have to be met. Diagnostic ASD evaluations should comprise an extensive assessment of the individual in various contexts, comparing the individual’s behaviors not only to the diagnostic criteria as described in manuals, but also taking into account expectations based on the overall level of intellectual functioning [2]. However, it appears important to reformulate and clarify some of the symptoms enlisted in DSM-5, such that researchers and clinicians can reach consensus about how to clearly map behaviors that are part of the ASD phenotype onto the DSM-5 sub-criteria.

Different classification procedures in the DSM-5 algorithms can lead to different algorithm outcomes based on the three instruments

The development of the algorithms differs between the instruments, both with regards to the selection of items, and with regards to determining the cut-off. The DISCO-11 selected items in a fully top-down manner, including items with the highest content validity with regards to DSM-5 criteria. The ADOS-2 and 3di took another approach, integrating bottom-up (data-driven) and top-down (construct-driven) elements. Instruments also significantly differed with regards to how the cut-off was set. For ADOS-2 and DISCO-11, cut-off scores were determined based on ROC analyses, whereas for 3di, it was based on consensus in the research team.

Psychometric properties of ADOS-2 and DISCO-11 have been reported to be good to very good [17, 21, 22], while the sensitivity and specificity of the DSM-5 algorithm of the 3di have not yet been established. To date, psychometric research for the three instruments has been based on groups with a DSM-IV-TR diagnosis of autism. Research including individuals with a clinical DSM-5 diagnoses of ASD is needed to establish psychometric properties, but this is largely lacking to our knowledge (except for two studies showing good psychometric properties of ADOS-2 in adults; [39, 40]).

For criterion A (‘Social interaction and communication’) and B (‘RRBIs’), the classification procedures of 3di and DISCO-11 are in line with the decision-rules as described in DSM-5, in the sense that an ASD classification requires combined problems in criterion A and B. However, criterion C (‘Early onset’) is less well implemented in the parental interviews, and interviewers need to remain mindful about this criterion. Although the 3di contains developmental items, its DSM-5 algorithm does not include any items about developmental history. DISCO-11 includes a set of items of which several are not part of DSM-5, which could have an impact on diagnostic classification according to the instrument.

Based on the ADOS-2 algorithm, however, an ASD classification can be given without meeting all DSM-5 criteria. For example, individuals with social communication disorder who only show impairments in criterion A (‘Deficits in social communication and interaction’), but no RBBIs (criterion B), may also reach the threshold for ASD classification on ADOS-2. The ADOS-2 manual should stipulate more clearly the potential consequence of this classification rule on diagnostic outcome according to the instrument. In both academic and clinical research, especially in the USA, the ADOS-2 is widely used, typically in combination with Autism Diagnostic Interview-Revised (ADI-R; [41]), and they are often referred to as the gold standard for ASD diagnosis. However, the psychometric properties of the (combined) ADOS-2 and ADI-R, for the clinical diagnosis of ASD in the DSM-5 era, have not been studied yet.

The interpretation of DSM-5 criteria C (‘Early onset’), D (‘Impact on daily life functioning’), and E (‘Not better explained by intellectual disability’), and the three levels of severity has not yet been clearly defined

Our comparative analyses also revealed that criteria C, D, and E were not (sufficiently) included in the commonly-used diagnostic instruments. These criteria seem to be neglected and underspecified in the development of instruments and published work at this stage. Compared to DSM-IV-TR, the early onset criterion is less strictly defined and the presence of symptoms that may be masked in early development, but cause impairments later-on, is clearly acknowledged in criterion C. However, our analyses showed that diagnostic instruments did not integrate this less strictly defined age-of-onset criterion in their algorithms. Clinicians generally use other sources of information to establish criterion C. Although the impairing effect of ASD symptoms on important areas of functioning (criterion D) is not explicitly included in the instruments’ algorithms, all instruments provide some information on the impact of characteristics that should be interpreted by the clinician, taking into account all available information. However, estimating the impairing impact can be rather difficult, certainly in individuals with more subtle, or with co-occurring problems. In addition, DSM-5 stipulates that disturbances should not be better explained by intellectual disability or global developmental delay (criterion E), but no guidance is provided on how to make this determination.

DSM-5 explicitly refers to the heterogeneous environmental modifications that are required for daily functioning, by outlining three levels of severity for socio-communicative impairments and RRBIs separately, based on the amount of support needed. Qualitative descriptions of the different levels of support are provided in DSM-5, but the operationalization seems to be more closely related to severity of ASD symptoms. However, previous research suggests that there is little overlap between different concepts related to severity and support [42]. Without a clear conceptualization and a standard method it hence seems unlikely that professionals are consistent in their classification as requiring ‘support’, ‘substantial support’ or ‘very substantial support’. Furthermore, there are no guidelines on whether characteristics with a known impact on severity or level of support, such as age, cognitive level, language ability or adaptive behavior, should be taken into account, nor on how this should be done [34, 42].

Diagnostic ASD assessment instruments and DSM-5: Conclusions and recommendations for research and clinical practice

Based on growing empirical evidence, DSM-5 has abandoned the different sub-classifications within the autism spectrum, and has stipulated two (instead of three) core domains of impairment [18, 30, 31, 43,44,45]. Inter-individual variability has been incorporated—at least partially—by including a range of exemplars and the option to indicate the level of required support separately for symptoms related to socio-communication problems and RRBIs. Furthermore, the co-occurrence of ASD with other disorders has now been acknowledged. In all, the new DSM has improved greatly with respect to clarifying nosology and providing more transparent descriptions of the core ASD characteristics [46, 47].

Diagnostic assessment instruments that have developed specific DSM-5 algorithms differ greatly from each other with respect to which ASD features are measured and their compatibility with the DSM-5 classification rules. It is crucial that users understand these limitations, both in terms of ASD characteristics (not) covered and in terms of the classification according to the algorithm, given the importance of these instruments in the context of academic research and clinical diagnostic assessments. Clinicians using the instruments in the context of diagnostic decision-making take into account that—as before—diagnostic classification should never be based solely on the score on one (or two) instruments [also see, e.g., 6], but should rely on the integration and clinical interpretation of different sources of information by a multidisciplinary team of experienced clinicians [2, 5]. Our results demonstrate which sub-criteria are (not) covered by specific diagnostic instruments, and hence highlight areas for each instrument where it would be important to collect additional information. In the context of (international) research, it should be emphasized that DSM-5 criteria are implemented differently by ADOS-2, 3di and DISCO-11. Caution is, therefore, warranted, when using one or two instruments to validate a clinical diagnosis, as the exclusion of participants based on the outcome of these instruments might lead to a biased understanding.

To advance clinical practice and research, we recommend future work be directed towards solving some of the existing ambiguities with regards to the definition and measurement of impairment in current functioning (criterion D) and severity levels. In addition, it remains rather difficult to distinguish between sub-criteria within criterion A and B, especially in the context of very young children or nonverbal individuals. Providing more clarity could lead to a more stable and accurate classification. Note that to fully understand and situate the symptoms, it is important that professionals carefully read the full text supplementing the list of criteria.

DSM-5 is currently the most important classification manual in the context of ASD. Recently, the International Classification of Diseases (ICD) has published its novel guidelines [13] and the criteria show a strong parallel with DSM-5. However, there also are notable differences between both manuals. In contrast to DSM-5, ICD-11 enlists eight subcategories in ASD, based on co-occurring intellectual or language impairments. In addition, ICD-11 does not provide concrete exemplars, and no required combination of number of symptoms. On the one hand, this might give more flexibility to clinicians, who have to judge whether an individual meets the threshold. On the other hand, this could also negatively impact the reliability and stability of diagnoses across settings and professionals. The future will tell how the differences between both classification systems will be integrated in diagnostic assessment instruments, and how they will impact prevalence rates.