1 Introduction

Autism spectrum disorders (ASDs) encompass neurodevelopmental conditions characterized by sociocommunicative impairments and presence of restricted and repetitive patterns of behavior [1]. Although ASDs occur on a spectrum from mild to severe, most individuals require lifelong support in some form (familial, community, taxpayer, health systems, etc.), making research in this population a high public health priority [2]. Despite a large number of neuroimaging studies, no distinct neural signature for ASDs has been established, likely because no single etiology could account for all variants of the disorder [3]. Moreover, many additional factors including age, comorbidities, and treatment history affect neural development; thus, neural signatures for ‘classifying’ ASDs at the individual level cannot be easily identified. Although findings of atypical brain functioning in ASDs have not fully converged, most studies using functional magnetic resonance imaging (fMRI) have found that children and adults with ASDs show atypical functional connectivity (FC) between many brain regions, both within and between neural networks [4]. FMRI detects low-frequency fluctuations in the blood oxygen level-dependent (BOLD) signal during task performance or in a resting state. Correlations between BOLD time series in different brain regions are considered to reflect FC [5].

Machine learning (ML) classification methods promise to advance understanding of how brain functioning is affected in ASDs at the group level, since they are data-driven approaches that do not require specific hypotheses and can tolerate the high dimensionality of fMRI datasets. Indeed, FC variables have been used in previous ML studies for diagnostic classification of ASD [6,7,8]. A subsection of these studies (e.g., [9,10,11]) implemented random forest (RF) [12], a nonparametric ensemble ML method that constructs numerous decision trees (classification and regression trees) and makes final predictions (e.g., ASD vs. non-ASD) based on the class majority votes or mean prediction from all trees. FC variables can be used to build classifiers with RF. Such data-driven approaches can identify atypical brain patterns in ASDs even when specific hypotheses may be unwarranted (given inconsistent evidence in the literature, e.g., [13].

Previous ML studies of ASDs have mostly included highly heterogeneous ASD samples with respect to the range of symptom severities, gender composition, medication status, and other variables. It is currently unknown how sample heterogeneity affects the performance of ML classifiers of ASDs. This restricts the inferences that can be drawn from previous research. Heterogeneity is a critical factor that has hindered progress in understanding atypical brain organization in ASDs in general [14]. For example, although about 20% of individuals with ASDs are female, the impact of gender on atypical brain connectivity in ASDs is not fully understood [15]. The limited literature comparing FC in males and females with ASDs suggests sex differences [15] that are network dependent [16, 17], presumably due to divergent neurodevelopmental trajectories [18]. Secondly, studies often include ASD participants with mild symptom severity, as these may comply more easily with constraints and instruction during rs-fMRI scans. However, since atypical connectivity may be more distinct in ASD individuals with greater symptom severity, inclusion of mild cases may impede diagnostic classification.

Our study aimed to examine the impact of ASD sample heterogeneity on ML diagnostic classification, with respect to both accuracy and selection of informative features. Many variables contribute to heterogeneity in ASDs (e.g., treatment history, language abilities, psychiatric comorbidities). However, given limitations of publicly available large samples, we focused on symptom severity and self/caregiver-reported gender. This focus allowed us to operationalize two factors of heterogeneity that likely affect functional connectivity. We gathered four samples of children and adolescents between the ages of six and eighteen, with varying levels of heterogeneity in gender composition and autism symptom severity, and used resting-state fMRI data to construct diagnostic classifiers for each. We hypothesized that classification accuracy would increase with greater sample homogeneity and that most important FC features would differ across samples.

2 Methods

2.1 Datasets and participants

Children and adolescents between the ages of six and eighteen with ASDs and typically developing (TD) same-aged peers were selected from an in-house sample collected at San Diego State University (SDSU) and from publicly available multisite databases, the autism brain imaging data exchange (ABIDE) I and II [19, 20]. The scanning sites from ABIDE included in the current study were SDSU (in-house, acquired in the period between 2012 and 2019), the Kennedy Krieger Institute (KKI), New York University (NYU), Georgetown University (GU), University of California at Los Angeles (UCLA), University of Utah School of Medicine (USM), and Oregon Health and Science University (OHSU). Supplementary Figures 1–4 illustrate the number of participants from each scanning site included in each sample. Although the in-house dataset is partially available in the ABIDE database, additional scans (unavailable in ABIDE) were included. For in-house data, ASD diagnostic status was established based on the DSM-5 criteria [1], supported by autism diagnostic observation schedule (ADOS) [21] or ADOS-2 [22] administered by research-reliable clinicians, the autism diagnostic interview-revised [23] administered to caregivers, and expert clinical judgment. Diagnostic status for ASD participants in the current study was adopted directly from the ABIDE database (see Supplementary Methods for details). Inclusion in the TD group required absence of ASD or any developmental disorder apart from ADHD. Other specific exclusion criteria used for selecting TD participants (e.g., absence of neurological or psychiatric history or absence of family history of ASDs) varied across ABIDE sites. Due to known effects of eye status on FC [24], only resting-state fMRI datasets acquired with participants’ eyes open were included. As even minute amounts of head motion are known to impact FC estimates [25], strict head motion criteria were adopted. Only participants with ≥ 80% of volumes retained after censoring volumes with framewise displacement ≥ 0.5 mm and root-mean-square displacement (RMSD, a summary measure of head motion throughout the scan) < 0.2 mm were included. Further, only ABIDE sites contributing ≥ 10 participants meeting these head motion inclusion criteria were included; 306 participants with ASD and 350 TD participants were included. The fMRI and clinical data shared in the ABIDE database were de-identified from protected health information. Collection of all in-house data was approved by the Institutional Review Boards at San Diego State University and the University of California, San Diego, and all participants provided written consent to participate and also to have their data included in the ABIDE databases. For ABIDE data, prior to data contribution, sites were required to confirm that their local Institutional Review Board or ethics committee had approved both the initial data collection and the sharing of fully de-identified datasets.

Four classification samples (each including 200 ASD and 200 TD participants) were created by constricting ASD sample heterogeneity in gender and ADOS or ADOS-2 total scores (depending on the edition of which was ADOS administered). The four samples were as follows: sample 1 (‘full heterogeneity’): participants with a full range (unrestricted) of ADOS Total scores, gender unrestricted; sample 2 (‘reduced gender heterogeneity’): males only, all ADOS Total scores included; sample 3 (‘reduced ASD-symptom heterogeneity’): gender unrestricted with ADOS Total scores ≥ 10; and sample 4 (‘low heterogeneity’): males only with ADOS scores ≥ 10 (see Supplemental Methods for details on ADOS Total scores).

Given the overall small number of female participants in ABIDE, corresponding to the generally lower prevalence of ASDs in females [26], female-only samples could not be included. Tables 1, 2, 3, and 4 present the demographic characteristics of training and validation datasets for samples 1–4. In each sample, ASD and TD participants were matched, at the group level, on head motion, handedness, and age. We also matched groups on gender (where appropriate) and performance intelligence quotient (PIQ) to the extent possible given sample limitations. Matching was carried out using an algorithm developed by coauthor A.J. [27], which determines the optimal overall multivariate matching solution for two groups of participants by using proximity or distance-based metrics across multiple variables (here: age, motion, PIQ, and in samples 1 and 3, gender). Where PIQ was not available from ABIDE data (sample 1: n = 58, sample 2: n = 58, sample 3: n = 67, sample 4: n = 66), full-scale intelligent quotient (FSIQ) was used for matching purposes. In a proportion of 80:20, datasets were selected randomly from each site for the inclusion in the training model and validation models (160 ASD participants for training model and 40 ASD participants for the validation model). ASD participants included in the training and validation models were separately matched to same-sized TD groups for training and validation using the matching algorithm described above. Sample overlap for the four comparisons is depicted using Venn diagrams in Supplementary Figure 4.

Table 1 Participant demographics for sample 1 (‘full heterogeneity’: all gender participants with unrestricted ADOS Total scores)
Table 2 Participant demographics for sample 2 (‘reduced gender heterogeneity’: males only with unrestricted ADOS Total scores)
Table 3 Participant demographics for sample 3 (‘reduced ASD-related heterogeneity’: all genders and ADOS Total scores ≥ 10)
Table 4 Participant demographics for sample 4 (‘low heterogeneity’: males only with ADOS Total scores ≥ 10)

2.2 Data acquisition and preprocessing

For each participant, anatomical MRI and resting-state fMRI scans were collected. Acquisition parameters for each included scanning site are detailed in Supplementary Tables 1 and 2. Resting-state fMRI images were preprocessed using the Analysis of Functional NeuroImages, AFNI, [28] (afni.nimh.nih.gov) and FSL 5.0 [29] (fmrib.ox.ac.uk/fsl) software packages. Volumes at the beginning of each scan were discarded for T1 equilibration as needed (see Supplementary Table 3). In-house resting-state fMRI data were field-map and slice-time corrected. All data were motion-corrected and resampled to MNI 3 mm isotropic standard space using normalization tools from FSL: FLIRT (functional to anatomical, six degrees of freedom) and FNIRT (anatomical to standard, using sinc interpolation). Images were spatially smoothed to a global full width at half maximum of 6 mm and temporally smoothed using a 0.008 < f < 0.08 Hz band-pass filter. Subject-level regression of eight nuisance variables and their first derivatives (six rigid-body motion parameters estimates from motion correction, mean white-matter signal, and ventricular cerebrospinal fluid obtained via FSL’s FAST) was performed for denoising. Global signal regression was additionally performed, as recommended for multisite data [30]. All seventeen nuisance regressors were band-pass filtered by the same second-order Butterworth filter (0.008 < f < 0.08 Hz) as the functional time series [31]. Single volumes with framewise displacement > 0.5 mm and two immediately following volumes were censored, and time series segments with fewer than  10 contiguous volumes after censoring were also removed. Preprocessed images were visually inspected for artifacts and suitable brain coverage.

2.3 Regions of interest and connectivity matrices

Mean time series were extracted from 237 cortical, subcortical, and cerebellar regions of interest (ROIs) and used to create a 237 × 237 Fisher-transformed Pearson correlation connectivity matrix for each participant. This final set of ROIs included 214 of 333 ROIs from the cortical atlas by Gordon et al. [32] and all 14 subcortical, as well as 9 of 26 cerebellar ROIs from the Harvard–Oxford subcortical [33] and cerebellar [34] atlases (combining several small ROIs). The following procedure was used to determine inclusion of each ROI: First, we identified all voxels of the brain which included BOLD signal in > 95% of participants. Next, we calculated the percentage of these included voxels in each ROI. Only ROIs for which > 95% of voxels contained true BOLD signal in > 95% of participants were included; 119 cortical and 17 cerebellar ROIs did not meet these criteria and were excluded. The connectivity matrices were calculated separately for each participant in the training and validation set, with a size of 320 × 27,966 and 80 × 27,966, respectively.

2.4 Machine learning algorithm

Random forest (RF) is an ensemble learning classifier that builds a set of classification trees using a training dataset and then classifies the validation dataset by the majority vote of the predictors. RF is superior to other ensemble methods, such as bagging and boosting, in terms of accuracy, computational time, robustness against overfitting, and user interface to choose tree size [12]. It has been suggested that conditional random forest (CRF), which mitigates problems tied to correlated predictor variables [35] as present in fMRI, performs better in feature reduction, whereas RF is preferable for classification [11]. Therefore, feature selection from the high-dimensional FC matrix (320 × 27,966) was carried out with CRF [11], and the classification model was built with RF, using packages from R [36].

In our study, using the RF algorithm the training data were obtained from a bootstrap sample of the data where approximately 2/3 of the data was included in the training dataset and the remaining 1/3, known as out-of-bag (OOB) data, was used to test the error of the predictions for each tree. The RF algorithm was initialized with the number of trees grown and the number of variables used for the split at each node. An un-pruned tree from each bootstrap sample was grown such that at each node a number of predictors were randomly selected as a subset of predictor variables, and the best split from among those variables was chosen [12]. The splitting algorithm used in RF is the GINI index criterion. The OOB data were predicted and corresponding error calculated based on the outcome of the predictions of the ensemble of trees, after all the trees were grown [37].

The GINI measure can identify the influential predictor variables in random forest, but the influence of correlated and uncorrelated predictor variables is better identified using the conditional permutation importance [38]. We used CRF dimensional reduction algorithm [11] with 2001 trees, to calculate the conditional permutation importance (CPI). The most informative features based on sorted positive values of CPI were selected. These most informative features were then fed into CRF dimensional reduction method to find the next set of significant features. We repeated this procedure until the feature set was reduced to 2. The optimal feature set for the model was identified based on the lowest OOB error in the training dataset. We evaluated the performance of the CRF model with optimal features in the test set using the RF classifier with a tree number of 20,000. The number of trees for the CRF and RF was chosen based on stability of results in both CRF and RF. The number of variables at each node was optimized as the square root of overall variables. The performance of the model using the test set was evaluated using the metrics such as accuracy, sensitivity, and specificity. The accuracy depicts the ability of the model to differentiate the ASD and TD correctly and is given by

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$

The sensitivity shows the ability of the model to determine the ASD correctly and calculated by

$${\text{Accuracy}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$

The specificity represents the ability of the model to determine the TD correctly and estimated by

$${\text{Accuracy}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FN}}}}$$

where TP is the true positive (the number of participants correctly identified as ASD), FP is the false positive (the number of participants incorrectly identified as ASD), TN is the true negative (the number of participants correctly identified as TD), and FN is the false negative (the number of participants incorrectly identified as healthy). The informative brain networks were identified from the features that helped to achieve peak classification accuracy.

3 Results

The performance of the classifiers over a range of feature sets, using the training data, is shown in Fig. 1a–d. Feature sets > 1024 features failed to produce higher accuracies and are not shown. The feature set achieving the lowest OOB error rate for the training model was selected, and its performance was tested in the validation dataset; classification results for the validation test are summarized in Table 5. This table indicates the influence of ASD-symptom severity and gender composition on classification accuracy, sensitivity, specificity, and number of features contributing to the best classification performance in the validation samples. Peak classification accuracies in the training models are also included. Reduced sample heterogeneity was overall associated with increased classification accuracy, sensitivity, and specificity. Figure 2a–d shows for each of samples 1–4 the proportion of ROIs included in the classifier that achieved peak classification accuracy from each of the brain networks included in the present study (classified according to atlases described in Sect. 2.3). For each sample, the specific neural networks that contributed the highest proportion of ROIs included in the diagnostic classifier are described in the sections below. Figure 3a–d shows, for each of samples 1–4, the percentage of ROIs participating in most informative connectivities per brain network.

Fig. 1
figure 1

RF OOB error rate, sensitivity and specificity after each dimension reduction for the training sample. Panels ad correspond to samples 1–4, respectively

Table 5 Peak classification accuracy, sensitivity, specificity, number of features and top networks, and connections for samples 1–4
Fig. 2
figure 2

Pie charts show the brain connections that helped to achieve peak accuracy in each of the four sample sets, separated by network. a Full heterogeneity, b reduced gender heterogeneity, c reduced ASD-symptom heterogeneity, d low heterogeneity

Fig. 3
figure 3

Heat maps show the number of connectivity features included in the diagnostic classifier between ROIs classified into brain networks. Panels ad correspond to samples 1–4

3.1 Sample 1: Full heterogeneity (male and female participants with unrestricted ADOS scores)

OOB error rates of the RF classifier on the training data for the different feature sets are shown in Fig. 1a. A maximum training accuracy of 83.7% (16.2% of OOB error) was achieved with the training model using 72 FC features. For validation sample 1, we obtained a classification test accuracy, sensitivity, and specificity of 62.5%, 60%, and 65%, respectively. Most informative connectivities for the classifier with peak accuracy predominantly involved ROIs in cingulo-opercular task control (COTC), visual, dorsal attention (DA), and somatosensory motor hand (SMH) networks, which together accounted for > 50% of ROI participations (Fig. 2a).

3.2 Sample 2: Reduced gender heterogeneity (males only with unrestricted ADOS scores)

Figure 1b shows OOB error rate, sensitivity, and specificity for the sample 2 training data. The training model discriminated ASD from TD participants with a peak accuracy of 86.9% (13.1% of OOB error) using 143 FC features. The validation set test accuracy, sensitivity, and specificity of the classifier were all 65%. Prominent among the informative regions were ROIs within COTC, DA, visual, and fronto-parietal task control (FPTC) networks, which together accounted for  > 50% of ROI participations. Figure 3b shows that the highest number of informative classification features (6 features) was found for FC between ROIs of COTC and DA networks.

3.3 Sample 3: Reduced ASD-symptom heterogeneity (all genders, ADOS scores ≥ 10)

The performance of the RF classifier using the training dataset of male and female participants with ADOS Total scores ≥ 10 is shown in Fig. 1c. For this sample, the training model achieved a peak accuracy of 87.2% (12.8% of OOB error) with 324 FC features. In the validation sample, accuracy, sensitivity, and specificity of classification were 70%, 67.5%, and 72.5%, respectively. Many informative FC features involved ROIs within the COTC, DA, visual, ventral attention (VA) and SMH networks, together accounting for 57.8% of ROI participations. Figure 3c shows that connectivities within the COTC (11 informative features) were among the most informative variables for diagnostic classification in this sample.

3.4 Sample 4: Low heterogeneity (males only with ADOS scores ≥ 10)

Training results for sample 4 are shown in Fig. 1d. The performance of the training model improved for the classification of male-only ASD participants with higher symptom severity scores from TD individuals, compared to other samples, with an accuracy of 89.4% (10.6% of OOB error) using 319 features. We also achieved the highest validation test accuracy of 73.7% in this sample. Sensitivity and specificity for this validation model were 75% and 72.5%, respectively. Features including ROIs from COTC, visual, SMH, default mode network (DMN) and DA contributed most to the classifier, accounting for a total of 56.8% ROI participations (Fig. 2d). Connections between ROIs in the COTC and DMN networks and between subcortical and COTC regions were most frequent in the classifier (Fig. 3d).

4 Discussion

4.1 Relationships between ASD sample heterogeneity and diagnostic accuracy

The current study sought to examine how ASD sample heterogeneity with respect to gender and current autism symptom severity impacts performance of ML classifiers built using FC features from resting-state fMRI data. Overall, we found that ML diagnostic classifiers and their accuracy were sensitive to sample heterogeneity. Specifically, we found that FC features maximally distinguishing between ASD and TD groups differed depending on sample composition. Classification accuracy for the validation test in sample 1 (with the greatest heterogeneity in gender composition and ASD-symptom severity) was lowest (62.5%) compared to all other sample sets. With more restricted sample heterogeneity (both in gender composition and in symptom severity), classification accuracy, sensitivity, and specificity increased.

The optimal number of FC features achieving peak accuracy was much higher in the more homogeneous sample 4 (319 features), compared to sample 1 with full heterogeneity (72 features). This was unexpected and is not readily interpretable. One possibility is that the number of FC features that can be used to discriminate between ASD and TD youth reflects commonalities in brain function among children and adolescents in the ASD group that are not shared by TD peers. Thus, an increased number of features could suggest that homogeneity of gender and symptom severity within the ASD sample is associated with greater homogeneity in atypical brain function, corresponding to a greater number of features informing diagnostic classification. Conversely, our results suggest that as sample heterogeneity increases, brain connectivity varies more within the ASD cohort and only a smaller set of shared connectivity features contributes to peak classification accuracy. Our results suggest that observed patterns of FC differences between ASD and TD cohorts are highly sensitive to cohort composition. This is consistent with recent reports of low replication in resting-state FC across ABIDE sites—and therefore across ASD cohorts [13, 39].

In the present study, we achieved the highest classification accuracy in external validation (73.75%) for sample 4, which included only males with ADOS Total scores ≥ 10. This classifier’s accuracy was higher compared to most previous fMRI ML studies of ASD [11, 40, 41] that applied classifiers to completely new external validation data in order to rule out a ‘peeking’ problem (which occurs when feature selection includes data that are also used for cross-validation) [42]; however, some studies have reported slightly higher classification accuracies [43]. Although achieving higher accuracy in classification is important for understanding how brain function may be different in ASDs, the impact of sample heterogeneity implies that classification accuracy may not be the sole determinant of a classifier’s usefulness. A lower classification accuracy obtained in a study including a highly heterogeneous sample that is representative of the diverse autism population at large may be informative in different ways, by pinpointing brain features that tend to be atypical across many (but not all) variants of ASDs. Future research using larger sample sizes that are more heterogeneous, and research aiming to identify subtypes of ASDs from representative samples, including sufficient samples of females and individuals with lower cognitive abilities, is warranted.

We observed that the accuracy of classification decreased as the dimensionality of the feature set was reduced to below 20 (Fig. 1). This suggests that information that is critical for diagnostic classification is lost when models are oversimplified, which may hinder efforts toward discovery of more robust FC biomarkers of autism [44]. In this context, it is remarkable that peak accuracy for most homogeneous sample 4 was achieved for a comparatively high number of over 300 features, which indicates that greater homogeneity at the behavioral and demographic levels may still be associated with heterogeneity at the neural level. Although classifiers with high diagnostic accuracy are important, models containing very large numbers of features may be difficult to interpret. Trade-offs between maximizing prediction accuracy and conceptual interpretation of findings must therefore be considered. In view of the common terminology of ‘diagnostic classification,’ it is important to remember that ML serves as a research tool for improved understanding of atypical brain functioning in ASDs, whereas applications as a truly diagnostic tool are currently (and possibly in principle) unrealistic for a clinically defined disorder.

4.2 Regions and networks most informative in diagnostic classification

The COTC network contributed the highest number of features to classifiers achieving peak accuracy, across all four samples. This informative role in classification was similar to some recent diagnostic prediction studies of ASD [10, 43]. However, representation of ROIs from other networks varied by sample set. We observed a higher number of informative features involving the COTC and DMN in samples including individuals with ADOS scores indicative of moderate to severe levels of symptoms (samples 3 and 4), compared to samples of participants with a wider range of symptom severity, including individuals with lower ADOS scores (samples 1 and 2). This finding may be related to a significant positive correlation between the ADOS and FC between COTC and DMN observed in a previous study [9]. Relatively high numbers of features from the visual network also contributed to classifiers across all samples. This finding adds to a growing body of research implicating atypical functioning of the visual regions in ASDs [45,46,47,48].

Finally, we found that DA and VA network ROIs contributed more features in sample sets 3 and 4, suggesting disruption of connectivity involving attention networks that may help differentiate more severely affected youth with ASDs from TD peers. These findings are also consistent with earlier reports that connectivity in DA and VA may help to differentiate between cohorts of ASD and TD individuals [49].

4.3 Limitations

Low availability of sufficiently large samples of low-motion fMRI data from a multisite source like ABIDE presents unavoidable constraints. Lack of an all-female comparison group was an unavoidable limitation. As further acknowledged in Sect. 1, only two prominent factors of heterogeneity in ASD samples were manipulated in our study. For example, age-related heterogeneity likely impacts fMRI findings of group differences [50]. In the present study, we matched samples on age keeping heterogeneity in age similar across all comparison groups (without being able to eliminate age-related heterogeneity). We used ADOS Total scores as indices of autism symptomatology, as this was the sole relevant measure available across ABIDE sites. ASD symptomatology does not straightforwardly relate to overall cognitive ability or other variables, such as executive functioning. Although we tried to match groups on IQ, the effect of heterogeneity on diverse cognitive and behavioral variables remained beyond the scope of our study, as such collateral data are limited and reported inconsistently across different sites contributing to ABIDE.

5 Conclusions

We examined the impact of sample heterogeneity on diagnostic classification of ASD. Overall, reduced heterogeneity with respect to gender and range of symptom severity was associated with improved performance of RF classifiers. Greater homogeneity of samples impacted both classification accuracy and selection of most informative features. Differences in sample heterogeneity may account for often conflicting findings in the neuroimaging literature on ASDs. Stratification by gender, symptom severity, age, cognitive ability, and other factors of variability may be critical in future efforts to pinpoint atypical brain features of ASDs.