Introduction

Autism spectrum disorders (ASD) are a range of neurodevelopmental disorders characterized by deficits in social communication and the existence of restricted and repetitive patterns of behavior, interest or activity (American Psychiatric Association 2013; Levy et al. 2009). At present, there is no cure to ASD, however, significant improvement in ASD symptoms can be achieved through a range of intensive behavioral interventions (Eaves and Ho 1996; Estes et al. 2015). Substantial evidence suggest that earlier application of such interventions achieve better outcomes (Dawson et al. 2010; Eaves and Ho 1996; Zwaigenbaum et al. 2015). Thus, early diagnosis of ASD on a population basis is warranted to reduce the burden associated with the disorder among affected children and their families.

It is accepted that ASD can be reliably diagnosed before the second birthday (Dawson et al. 2010; Worley et al. 2011; Zwaigenbaum et al. 2015). Thus, the American Academy of Pediatrics recommends that all children should undergo standardized screening testing to detect autism between the ages of 18–24 months (Johnson and Myers 2007). There are multiple screening tools for ASD; some are based on parental reports, while others require direct observation and engagement by clinicians. Parent-report tools often have the advantage of being brief, inexpensive, and practical in the office setting.

One of the most commonly used ASD-screening tools worldwide is the Modified Checklist for Autism in Toddlers (M-CHAT) (Robins et al. 2001). This two-stage screening tool uses 23 yes/no items that can be completed by parents of toddlers between 16 and 30 months of age. Both the M-CHAT and its recent revision, the Modified Checklist for Autism in Toddlers, Revised, with Follow-Up (M-CHAT-R/F) (Robins et al. 2014), demonstrated high sensitivity and specificity in detecting toddlers with ASD in the US (Kleinman et al. 2008; Robins 2008), as well as in several other countries (Mohamed et al. 2016; Nygren et al. 2012; Samadi and McConkey 2015). In addition, it performed better in detecting children with autism than various other more general screening tools (Pinto-Martin et al. 2008; Wiggins et al. 2014). Yet, there is still insufficient evidence regarding the benefits and harms of utilization of the M-CHAT as a universal screening tool for ASD (Robins et al. 2016; Siu et al. 2016). In particular, concerns have been raised about the fate of children with ASD who pass the M-CHAT at 18 months (false negative) (Øien et al. 2018; Stenberg et al. 2014). Thus, further evaluation of the M-CHAT and other ASD screening tools in diverse populations is warranted.

Toddlers in Israel undergo global developmental screening (GDS) by nurses at government-funded maternal child health centers (MCHCs) (Israel Ministry of Health 2016). The GDS is performed at each visit in the MCHC (i.e. at ages 3, 6, 9, 12, 18, 24, 36, 48, and 60 months) and includes 4–9 items that assess different age appropriate developmental skills in the areas of gross motor, fine motor, language and communication, and emotional-social domain which are examined through interview of the parent or direct observation of the child. If the child shows risk on one or more of the GDS items, he/she will be invited for a follow-up examination or will be referred to a child developmental clinic for further evaluation. Since the GDS has been designed to detect general developmental problems, we hypothesized it might have a poorer performance in detecting toddlers with ASD compared to ASD-specific screening tools such as the M-CHAT. To test this hypothesis we compared the screening performance of the M-CHAT and GDS in a sample of 1591 toddlers at ages of 18–36 months at 35 randomly selected MCHCs in the Negev.

Methods

Population

There are approximately 16,000 live births in the southern part of Israel, the Negev, annually of which ~ 50% are of Bedouin Arabs, ~ 50% are Jewish, and less than 1% are immigrants of other ethnicities (The Central Bureau of Statistics of Israel 2016). The development of these newborns is routinely monitored from birth until age 6 years in 47 government-funded MCHCs distributed across the region. The attendance at these MCHCs is extremely high with 95%–99% of all newborns attending MCHC for developmental assessments, vaccinations, and other services during infancy (Bin Nun et al. 2010).

ASD Screening

Screening for ASD or other DD was conducted between March 2015 and December 2016 in 35 out of the 47 MCHCs from the Negev where the nurses agreed to administer the M-CHAT questionnaire in addition to their routine training in administration of the GDS. We used the original version of the M-CHAT (the one with the 23 yes/no items; Robins et al. 2001), because at the time of the study there was no Hebrew translation of the more recent M-CHAT-R/F version. Nurses who participated in the study had a 1-day workshop about ASD screening and diagnosis as well as specific training in M-CHAT administration. All these nurses had previous experiences in the administration of the GDS test.

Subjects were toddlers at ages of 18–36 months who came for regular developmental assessments in these clinics. Both M-CHAT and GDS screening were conducted by the nurses at the clinics because many parents, especially in the Bedouin population, have poor reading and writing capacities. Items that could not be evaluated at the clinic were scored according to the parents experience with their child. Toddlers who screened positive at the M-CHAT questionnaire were followed-up by an ASD specialist nurse using the M-CHAT follow-up questionnaire (M-CHAT/F). Toddlers who screened positive at either the M-CHAT/F or the GDS, were referred to further developmental evaluation at the Soroka University Medical Center (SUMC). Diagnoses of ASD or other developmental problems were determined by a child psychiatrist or child neurologist according to the DSM-5 criteria (American Psychiatric Association 2013) as described before (Meiri et al. 2017).

Screening Evaluation

False negatives were determined 10 months after study completion by reviewing the medical records of all toddlers who were screen negative at both the M-CHAT/F and the GDS, and identifying those with reported ASD or DD diagnosis. We used standard screening validity measures [i.e. sensitivity, specificity, positive & negative predicted values (PPV & NPV)] to compare the ability of both M-CHAT/F and GDS to detect toddlers with ASD or other developmental problems.

Assessment of the GDS Tool

Currently, nurses at the MCHCs refer toddler for further developmental evaluation based on the GDS results and their own impression of the child development. To have an objective assessment of GDS tool we collected the full history of the GDS tests performed at 9, 12, 18, 24 and 36 months of the 50 toddlers who failed at the M-CHAT/F test, and another 150 randomly selected toddlers with normal M-CHAT/F scores. Then, we asked eleven experienced MCHC nurses (2–30 years working at the MCHC) to review these data and give a referral decision for each of these toddlers. Importantly, the nurses had no other information about these toddlers except of their sex and age. We used Fleiss-Kappa statistics to assess inter-rater agreement between these eleven nurses. The developmental skills that are evaluated by the GDS at 9, 12, 18, 24 and 36 months are described in the supplementary materials.

Results

M-CHAT/F—GDS Comparison

A total of 1,591 toddlers, between 18 and 36 months of age (Mean = 21.30 ± 3.45 months) were screened by both GDS and M-CHAT/F at the selected MCHCs in this study (Table 1). There was a small but significant overrepresentation of Bedouins and males in this sample (60% and 54% respectively; p < 0.001).

Table 1 Study population

Figure 1 shows the screening results of this study. Overall, 24 toddlers were detected by both the M-CHAT/F and the GDS as needing further developmental evaluation (Kappa = 42%, p < 0.001). Another 26 and 35 toddlers were detected by only the M-CHAT/F or the GDS respectively. Of these 84 toddlers, seven received a diagnosis of ASD, 30 received a diagnosis of other forms of developmental delays (DD) (e.g. Motor, or Speech and language delay etc.), 11 did not complete their diagnostic process at the time of this analysis, and four were lost to follow up. Notably, the M-CHAT/F was more sensitive than the GDS in detecting toddlers with ASD (sensitivity of 70.0% vs. 50.0% respectively), and slightly more specific (specificity of 98.2% vs. 96.6% respectively) (Table 2). These differences translated into a 2.3-fold difference between the PPV of these two screening tools (20.0% vs. 8.6% for the M-CHAT/F and GDS respectively). Notably, all the five toddlers that were screened positive in the GDS, were positive also in the M-CHAT/F, thus combining the results of the GDS + M-CHAT/F did not produce better screening results than the M-CHAT/F alone.

Fig. 1
figure 1

A flow chart of the screening results of this study. Participants were assigned diagnoses of autism spectrum disorder (ASD), other types of developmental delay (DD), or normal development (ND). Toddlers who did not complete their diagnostic evaluation by the time of this study were considered as lost to follow-up (LF)

Table 2 Screening efficacy of the M-CHAT/F and GDS in detecting toddlers with ASD

We also assessed the efficacy of these two screening tools to detect toddlers with other forms of DD (Table 3). Both M-CHAT/F and GDS had similar specificity and sensitivity in detecting toddlers with DD (Sensitivity = 63.3%, Specificity = 97.5–99%). However, the combination of these two tools, resulted in a remarkable increase in the sensitivity of the screening (Sensitivity = 93.7%) without changing its specificity (Specificity = 98%).

Table 3 Screening efficacy of the M-CHAT/F and GDS in detecting toddlers with non-ASD developmental delay (DD)

GDS Results Evaluation

Results of the GDS evaluation by a sample of eleven MCHC nurses are displayed in Fig. 2. There was a moderate inter-rater agreement between nurses regarding the referral decision of toddlers with ASD or other DD (Fleiss Kappa = 0.435), with only four of these toddlers (13%) detected by all eleven nurses (Fig. 2a). There was a better consensus among the nurses regarding the 170 typically developing toddlers in this sample (Fleiss Kappa = 0.721) (Fig. 2b). Interestingly, there were four typically developing toddlers whom all nurses thought they needed further developmental evaluation, and another five toddlers who were referred by 10/11 of the nurses. Further examination of the factors associated with nurse’s referral decisions revealed a significant association with the number of failed GDS items (Spearman r = 0.65; p < 0.001), with ≥ 5 failed items serving as the referral cutoff for the majority (≥ 6) of the nurses (Fig. 2c). Notably, seven toddlers who did not fail at any of the GDS items were still referred for further evaluation by a few nurses. In addition, there was one toddler who failed in only two GDS items and was referred for further developmental evaluation by seven nurses. In contrast, there was one toddler who failed in five GDS items and was referred for further developmental evaluation by only one nurse. Using the ≥ 5 failed GDS items as a referral cutoff obtained 63% sensitivity and 92% specificity in identifying toddlers with ASD or other DD. Nurse’s referral decisions were not associated with the age, gender or race of the toddlers.

Fig. 2
figure 2

Referral decision of 11 nurses based on blinded GDS results. Gray and White boxes indicate referral and no referral decisions respectively. a Referral decision for the 30 toddlers with ASD or other developmental problems. b Referral decision for the 170 toddlers with normal development. c The number of referral nurses are plotted vs. the number of failed GDS items. Gray circles are for typically developed toddlers, and empty black circles are for toddlers diagnosed with ASD/DD. Circle sizes are proportional to the number of toddler in each group. Vertical dashed line indicates a putative referral cutoff of five failed GDS items, and horizontal dashed line indicates the median of referring nurses

Discussion

The purpose of this study was to compare the screening efficacy of GDS vs. M-CHAT/F in detecting ASD or other DD in a sample of 1591 toddlers from southern Israel. The prevalence of toddlers with ASD in our sample (0.68%) was slightly higher than the prevalence of ASD in Israel reported in two other recent studies (Davidovitch et al. 2013; Raz et al. 2014). This difference in ASD prevalence can be mostly attributed to the active screening for ASD with both M-CHAT/F and GDS that was used in this study. Such active screening for developmental problems is particularly relevant to the Bedouin population that traditionally has been characterized with low prevalence of ASD (Mahajnah et al. 2015; Meiri et al. 2017). Thus, incorporating routine, standardized screening for ASD in this population will help reduce existing ethnic disparities in ASD diagnosis and possibly also in ASD treatments as implied by other studies (Herlihy et al. 2014).

Notably, our ability to review the medical records of all 1,591 toddlers in this study and consequently assess the false negative rates of both M-CHAT/F and GDS is a major strength of this study. The relatively high rates (up-to 30%) of false negatives in this study are consistent with the reports of Stenberg et al. (2014), and Øien et al. (2018), and likely due to children with milder symptoms of ASD that are hardly noticeable at ages of 18–36 months. While somewhat better performance could be achieved if the more recent version of the M-CHAT, the M-CHAT-R/F, were used (Robins et al. 2014), additional false negatives are also expected at later ages. Nevertheless, the primary goal of screening tools for ASD is to detect children with the disorder as early as possible. In this regard, the better performance of the M-CHAT/F over the GDS in detecting ASD in these ages suggest that its implementation at the MCHC clinics could help reducing the age of ASD diagnosis in this population.

The better performance of the M-CHAT/F in detecting ASD cases compared to the GDS is consistent with results of other studies demonstrating the value of the M-CHAT/F in detecting toddlers with ASD (Chlebowski et al. 2013; Toh et al. 2017). On the other hand, the GDS is applicable to a wider age range, and therefore more flexible. Thus, toddlers tested by the GDS might be given several opportunities to achieve certain milestones before failure is determined. Given the differences between the GDS and the M-CHAT/F, a combination of these two tests in toddler’s developmental screening might be optimal (Kamio et al. 2014; Nygren et al. 2012). Our results suggest that screening toddlers with both M-CHAT/F and GDS will have a similar efficacy as the M-CHAT/F alone in detecting cases with ASD and will improve the efficacy in detecting toddlers with other types of DD compared to application of either the M-CHAT/F or the GDS screening approaches alone.

The GDS test is designed to identify various developmental issues. However, the decision whether to refer toddlers to further developmental assessment currently relies on the examiner’s judgement, and therefore variable (Mendonca et al. 2016). The moderate inter-rater agreement between the eleven nurses in our study indicates that nurses often disagree regarding a referral of the same toddler. One possible source for this variability is that sometimes failure or achievement of specific milestones are reported by parents without the ability to test them in the clinic (e.g. toddlers who refuse to cooperate). This may result in examples such as the toddler in our sample who failed in five items and still was not referred for further developmental evaluation by the majority of the nurses. In addition, when conducting the GDS test, some nurses record the parents’ concerns regarding the development of their child. This may be the reason for the few toddlers who apparently did not fail at any of the items but still were referred for further evaluation by some of the nurses. Another factor that may affect the referral decision of the nurses is the cultural or ethnic suitability of developmental milestones as suggested by Magalhães et al. (Magalhães et al. 2015). Although we didn’t observe significant ethnic differences in the nurses’ referral decision in our study, such ethnic disparities should be considered in the development of future developmental tests. These results emphasize the need for standardized referral guidelines for the GDS along with complementary training of the nurses at the MCHCs.

We noticed that an important factor in the referral decision of the nurses in this study was the number of failed GDS items with failure in five items or more serving as a referral cutoff for the majority of the nurses. Interestingly, a similar cutoff was used for both toddlers with developmental problems and toddlers with normal development without significant differences between the two groups. The small size of our sample didn’t allow us to examine if any of the GDS items or specific combinations of them are better predictors of ASD or other developmental problems. Results of such analysis will help to develop standardized referral guidelines for the GDS, which will reduce the remarkable variability in nurses referral as observed in this study.

Our study has several noticeable limitations. (1) Despite the large initial sample size of 1591 toddlers, the low prevalence of ASD and DD in the population resulted in a small number of cases that limit our ability to have a stable assessment of the screening sensitivity. (2) 15 of the 50 toddlers (30%) who were positive at the M-CHAT/F did not complete their diagnosis. This relatively large fraction of lost to follow-up may result in under estimation of the sensitivity of this screening tool. We currently investigates the factors that are associated with referral compliance and time-to-diagnosis of toddlers that have been referred to SUMC with suspicion of ASD. (3) The referral assessment by the eleven nurses was done retrospectively without a direct examination of the children by these nurses. It is possible that better inter-rater agreement between nurses would have been achieved if this evaluation was done in a prospective manner with all nursed actually observing these toddlers at the clinic.

Conclusions

Our findings indicate that employment of the M-CHAT together with the GDS in the routine developmental screening at the MCHCs, would improve detection of toddlers with ASD and other DD in this population.