Keywords

1 Introduction

Autism spectrum disorders (ASD) are a group of highly heterogeneous disorders characterized by repetitive behaviors, impaired social interactions and a wide spectrum of neurodevelopmental and physical comorbidities. While the overall prevalence of ASD is estimated at 1 in 68 children [1], this may not represent an increase in incidence as much as it represents a widening of the scope of disorders that fit under the ASD umbrella in an era of improving clinical ascertainment. As a spectrum disorder, ASD may present as an isolated set of symptoms or with multiple comorbidities, including but not limited to intellectual disability, developmental delay, epilepsy, gastrointestinal complications, cardiac problems, immune disorders, etc. [2]. This heterogeneity is apparent even in the settings of identical genetic backgrounds (e.g., monozygotic twins discordant for co-morbidities), underscoring the complexity of understanding ASD on the molecular level.

While ASD disproportionately affects males (male: female ratio of 3.4:1), the reasons for this remain poorly understood. In fact, the search for a molecular etiology is further complicated by the interplay of both genetic and environmental factors which together contribute to pathogenesis. Importantly, high incidence despite the significant impairment of reproductive fitness means that the cause of ASD is likely different among most cases of ASD, i.e., unrelated patients will rarely share the same mutation or even the same gene. Despite that, it is clear that ASD has a major heritable component, with siblings of ASD patients usually having a one in five risk (ten-fold higher than population average) of developing ASD themselves [3]. Further, concordance between monozygotic twins ranges from 30% to 99%, and the overall heritability is estimated between 0.5 and 0.8 [4,5,6,7,8].

This complex landscape has made the search for and discovery of genetic factors using traditional methods very difficult in the general population, mainly due to studies being underpowered to detect causal variants in small sample sizes. As expected by the limitations imposed by previous technologies, the majority of loci identified were in the form of chromosomal abnormalities, with few individual genes identified. The subsequent introduction of high-throughput microarrays enabled the investigation of smaller chromosomal abnormalities termed copy number variations (CNVs), and study of associations between common variants and the trait of interest. Signals detected from these three approaches (linkage, karyotyping and microarrays) rarely produced single-candidate genes; usually narrowing the search space to several kilobases or megabases, in which the search for causative genes was iterative and time-consuming. Alternatively, some mutations could be found by resequencing genes known to cause similar phenotypes in model organisms in a larger patient cohort.

Subsequent technological improvements led to the advent of next-generation sequencing (NGS), which has transformed the field profoundly, allowing the discovery of different classes of variations (e.g., single nucleotide variants (SNVs) and insertions/deletions (indels)) genome-wide. This enabled making discoveries from nuclear families in the absence of multiple affected or large pedigrees to establish linkage. The proliferation and accessibility of NGS technologies mean the bottleneck is no longer the ability to detect variants in a cost-effective manner, but the ability to amass cohorts that are large enough to capture a significant proportion of the genetic and phenotypic heterogeneity underlying ASD in the general population.

This chapter summarizes gene discovery in ASD in the pre- and post-sequencing era, explaining the significance of these discoveries, and framing them in the larger context of the potential impact that genomics will have on ASD diagnosis and care in the future.

2 Pre-NGS Era

2.1 Introduction

Next-generation sequencing (NGS) refers to advancements in technology that have enabled large-scale sequencing of many DNA fragments at the same time. These advancements have allowed the interrogation of variation at many loci in the genome in parallel at reasonable speed and cost, thus increasing the efficiency of genetic research. While NGS technologies began to appear in academic environments over a decade ago [9], their proliferation and adoption into the mainstream was not until more than a decade later. Importantly, while several different technologies appeared initially to compete for adoption, it was Illumina’s short-read sequencing technology that was able to capture the biggest market segment with a combination of price point, accuracy and speed. And while today the price of a single human genome is around $1000, the price was significantly higher up until just a few years ago, rendering large-scale studies still very costly. This section covers discoveries made in autism genetics prior to the introduction to NGS to study this condition, whereas the next chapter will cover discoveries made when large-scale genomic assessment became increasingly affordable, in what is known as the post-“genomic” or post-NGS era.

2.2 Linkage Studies

Due to the paucity of multiplex or extended pedigrees with ASD, linkage approaches were not a robust approach to gene discovery in ASD. Nevertheless, numerous studies were performed (reviewed in [10]), revealing few loci in total. Notably, of these, only two were ever replicated successfully in an independent study. These include linkage to chromosome 7q35, containing the CNTNAP2 gene [11], and to chromosome 20p13, containing the four genes [12].

2.3 Association Studies

The development of cheap high throughput microarray genotyping technologies with higher marker density empowered a flurry of genome-wide association studies (GWAS) in a wide variety of human diseases. The discovery of markers by GWAS has two main limitations. First, they indicate loci with small effect size on the trait, sometimes increasing odds ratio by as little as 0.05 [13]. Second, they require very large sample sizes to have sufficient power to discriminate alleles between cases and controls. For ASD, large cohorts were not possible to amass for sufficient power, and therefore while GWAS were attempted over the past 10 years, few were ever replicated [12,13,14,15,16,17]. This is in contrast to studies of other neurodevelopmental conditions such as Schizophrenia for which cohorts could be amassed in the tens of thousands to discover tens of loci that replicate in independent cohorts. For ASD, only two loci have been implicated using GWAS to date, including a locus on 5p14.1 (containing the CDH9 and CDH10 genes), and another on 20p12.1 (MACRO2 gene) [15, 16]. Importantly, consistent with the genetic heterogeneity and the need for very large numbers, neither of these loci has been replicated.

2.4 Chromosomal Abnormalities Studies

The association of ASD with other syndromic comorbidities such as Fragile X and intellectual disability was a first indicator that chromosomal-level events could be underlying a subset of the condition. The concurrent evolution of microarray technologies introduced the ability to rapidly detect structural copy number variations in human genomes at scale. Together, karyotyping and CNV analysis have uncovered tens of chromosomal segments involved in ASD, including duplication of 15q [18], deletion of 22q11.2 [19, p., 200], deletion of 16p11.2 [20] and deletion of Xp22.3 [21]. In addition, several recurrent hotspots of de novo CNVs with ASD include duplications on 7q11.2 and deletions of 16p11.2, the latter also associated with schizophrenia [22, 23].

A key feature of CNVs is that they range widely in size from single-gene deletions to large regions encompassing tens to hundreds of genes. Consistent with multi-genic contribution to other phenotypes, patients with multiple de novo CNVs or large chromosomal abnormalities usually have more severe, syndromic phenotypes [24, p. 2], [25].

As cohort sizes grow, it has also been shown that CNV-affected genes predominantly comprise candidates from three key pathways, including neuronal signaling, synaptic function, and chromatin remodeling [26], [27, p. 201]. Together, these studies not only identify novel loci, but demonstrate that de novo CNVs are strongly associated with ASD [28] and that recurrent CNVs point to shared architecture with other diseases.

2.5 Candidate Gene Resequencing Studies

In contrast to the paucity of discovery from linkage studies, work from both other syndromes and CNV studies identified several candidate human ASD genes that could be screened by resequencing in larger cohorts. By assessing larger cohorts for mutations in these genes, the following genes were all found to harbor damaging point mutations in ASD subjects: MECP2 (Rett syndrome), TSC1 and TSC2 (tuberos sclerosis), CACNA1C (Timothy syndrome), NLGN3 and NLGN4 (X-linked mental retardation), and CNTNAP2 (7q35 deletion), SLC9A9 and BCKDK (epilepsy), etc. [29,30,31]. Other genes also discovered to carry rare damaging variants by resequencing include SHANK1, SHANK2, SHANK3, NRXN1 and NRXN3 [32,33,34,35,36].

2.6 Conclusion

In conclusion, the pre-NGS era relied mainly on candidate gene resequencing and association studies to link genes and loci to Autism. Unlike other monogenic disorders, linkage analysis was not a very successful approach to finding genes linked to Autism primarily due to the requirements of large pedigrees or multiple kindreds segregating the same locus, which are difficult to find considering the genetic heterogeneity underlying Autism and the detrimental effect it has on reproductive fitness.

3 The NGS Era

3.1 Next-Generation Sequencing as a Tool to Study Genetic Disease

Over the past decade, there have been numerous tools developed for next-generation sequencing (NGS) (Table 1). At its core, NGS may be broadly classified into two categories: whole genome sequencing (WGS), and targeted NGS. While the former is concerned with reading the entire content of an organism’s genetic material, targeted NGS methods focus on selectively sequencing a group of genes (“gene panels”), usually selected based on specific selection criteria, e.g., having been identified in smaller cohorts or in animal studies, or genes within the same pathway(s) as well-established candidate disease genes. These gene panels may be customized to include any number of genomic fragments of interest, including, for example, all coding regions—commonly known as whole exome sequencing (WES). Typical WES experiments also capture flanking regulatory regions, enabling discovery of variants affecting splice junctions and untranslated promoter and downstream sequences [37].

Table 1 Comparison of different sequencing technologies

Whole genome sequencing (WGS), on the other hand, covers both WES regions as well as non-coding and inter-genic regions. It is usually faster and more uniform because it does not require target panel capture, and thus can be performed with minimal sample preparation, resulting in sequences that are evenly distributed across all chromosomes. This distribution of sequencing coverage means that variants can be confidently assigned at average depth of sequencing as low as 20X. Conversely, whole-exome and other panel sequencing requires target enrichment and PCR amplification, often resulting in highly variable coverage profiles with some regions (e.g., repetitive elements or GC-rich content) being missed due to the technical limitations. Another important advantage of WGS’s even coverage is the ability to discover genome-wide structural variants (including copy number variants). Given the number of human disorders (including ASD) in which structural variants play a significant role, a single test that can assess both large and small genomic variation is often cited a reason to use WGS despite its slightly higher cost vis-à-vis using a combination of microarray and WES for each patient.

3.2 Bioinformatics and Variant Interpretation

One important aspect of the NGS approach is the generation of large quantities of data, often requiring sophisticated computational tools (bioinformatics) to interpret. Specifically, bioinformatics pipelines share three major steps in common, irrespective of the NGS technology used: read alignment to a reference genome, variant calling versus the reference, and variant interpretation to determine pathogenic from benign variation.

Genetic variants may belong to several different classes, including: single nucleotide variants (SNVs, including single nucleotide polymorphisms (SNPs)), multi-nucleotide variants (MNVs, including small insertions and deletions (indels)), and structural variations (SVs, including copy number variations (CNVs)). For all three variant classes, a number of statistical considerations need to be taken into account to sort out likely true positive variants from noise, including: depth of sequencing, sequencing quality, the number of times mutations are observed, and the likelihood that such a change is true rather than an artifact of sequencing [38]. Importantly, the joint steps of read alignment and variant calling may themselves introduce error into the experiment, e.g., for fragments coming from highly repetitive genomic segments [39, 40].

The most challenging aspect in bioinformatics pipeline is variant interpretation—the step where tens to hundreds of variants may require in-depth manual scrutiny to determine putative effect on disease. Robust variant interpretation requires a well-annotated reference genome (for both coding and non-coding elements) and a large number of control individuals to accurately discriminate putative disease causing variants from population-specific polymorphisms (that may rarely appear in public databases because inadequate numbers of population-matched controls are available) [41,42,43,44,45].

3.3 NGS Suitability in Routine Clinical Care

As NGS technologies become more widely adopted in academic hospital settings, there is a growing need to establish gold-standard pipelines to allow for genomics to enter routine clinical testing [46, 47]. While some guidelines do exist, especially for diagnostic laboratory settings, these guidelines vary widely and currently still require orthogonal validation before they are deemed actionable [46, 48]. The role of a clinical-grade pipeline is primarily to demonstrate processing and interpretation in a highly reproducible manner, thus ensuring disease management is not compromised from this approach [47]. These steps, however, are non-trivial—they would need to account for influences on data quality and sources of error, for example, sample prep using protocols, sequencing instruments and batch effects, ensuring all genes in a panel are adequately captured, errors in sequencing chemistry and noise from the sequence alignment and variant calling steps.

Further, these tasks scale in complexity with the number of samples being studied and the databases from which annotations are being drawn. Of key consideration, for example, is the large number of variant sites produced per NGS run (three to four million per genome). Amongst these, hundreds or thousands of variants would be considered variants of unknown significance (VUS) whose interpretation and relevance to health and disease is completely unknown [46, 49]. In many cases, the recruitment of parents and siblings could help with sorting through these variants, but still tens to hundreds remain “private” variants with unknown function. For NGS to be adopted in routine care, clinical platforms must deal with such cases systematically, bearing in mind not to discard these variants because they may have future value as the genome is better annotated in the academic literature. Moreover, clinical platforms should take into consideration the constantly evolving annotations of genes, e.g., >200 new genes and hundreds of variants are being linked to diseases each year [50,51,52,53], and thus variant sharing as part of consortia may mitigate the absence of variants in the publication record. Such considerations need to be taken into account when designing clinical NGS pipelines, to ensure that genetic testing of patients is accurate, reproducible and safe. Only by controlling for these factors in a statistically robust framework would it be possible to ensure reproducibility and standardization, thereby enabling precision in data interpretation in disease settings.

4 Successful Application of NGS to Autism Spectrum Disorder

4.1 Sample Size and Cohort Considerations

The evolution of NGS thus enables the assessment of single families and single cases at a rate not performed before. The biggest challenge lies in discriminating rare alleles from population-specific polymorphisms, a challenge that can only be adequately addressed by sequencing a large enough number of both patients and of ethnic/population-matched controls. This is especially important in the setting of high genetic heterogeneity, where it is unlikely to find individuals sharing mutations in the same gene, let alone the same pathogenic variant. In recent studies, for example, sample sizes of >2000 families were required to identify recurrent gene and copy number regions shared between individuals [20, 23, 27, 54].

Conversely, in settings with high consanguinity, the approach of finding recessive variants is boosted by the usual availability of affected siblings or additional cousins (in multiplex families) who share the same homozygous mutations in candidate genes. However, the identification of recessive genes causing ASD has been limited so far by the type of families studied—mostly outbred simplex families with unrelated parents. In rare cases where families with multiple affected siblings were identified, they were found to have two different de novo causative variants rather than the same recessive variant [55]. This is expected due to the high levels of genetic heterogeneity underlying ASD.

However, this presents an important opportunity for consanguineous populations attempting ASD studies, with some initial reports reporting promising results [56,57,58,59].

4.2 Exome and Genome Sequencing

Due to the paucity of studies in ASD families from areas of high consanguinity, recessive variants causing ASD have only been identified so far in the following genes: AMT, BCKDK, CNTNAP2, PEX7, SLC9A9, SYNE1, VPS13B, PAH and POMGNT1 [60]. In contrast to the few recessive genes discovered, the vast majority of families studied to date have been outbred, in which single affecteds (simplex) are born to unaffected, unrelated parents. In these cases, the genetic architecture is usually driven by de novo mutations, or rare inherited variants; however, even when multiple siblings are found in the same family, they are sometimes found to harbor separate de novo variants, stressing the importance of this type of variation in ASD etiology. There are approximately 800 genes affected by de novo variants in ASD (not counting genes within de novo chromosomal abnormalities) [10]. Altogether, the contribution of de novo mutations in ASD is estimated to be between 15% and 25% [61].

In 2012, four groups published concurrent studies using exome-sequencing to identify de novo gene disrupting variants in ASD patients. Only approximately 20 of these genes were recurrently hit across the cohorts, including: ADNP, ANK2, ARID1B, BCL11A, CACNA2D3, CHD8, CUL3, DSCAM, DYRK1A, GRIN2B, KDM5B, KDM6B, KMT2C, KMT2E, KMT5B, NCKAP1, PHF2, RIMS1, SCN2A, SYNGAP1, TBR1, TCF7L2, TNRC6B, and WAC [22, 54, 62,63,64,65,66,67]. However, the majority of the other genes identified were singletons (only observed in a single patient without replication), but their potential role in ASD was supported by their impacting critical pathways in neurological development, such as cognition, synaptic formation, and regulation of transcription of brain-specific genes [54, 67, 68]. In addition to de novo variants affecting genes directly, more recent studies have found an enrichment of de novo and private disruptive mutations in DNAse I hypersensitivity sites in regions close to some of the genes that have been implicated in ASD [69]. This indicates that ASD genes disruption is not only through mutations that may alter function but also those that may alter gene regulation. Notably, one recurrent theme across most studies is that de novo point-mutations are predominantly paternal in origin, with the rate of de novo mutations increasing with paternal age.

Despite advances in WES, less than 10% of known patients receive a genetic diagnosis in ASD. This is far lower than the solve rate of neurodevelopmental disorders as a whole, with diagnoses above 30% of cases. Nevertheless, the utility of WES and WGS extends beyond simple diagnostic value, as it has allowed the identification of genes underlying more complex syndromes shared with ASD. For example, de novo mutations in the SWI-SNF-related gene ADNP causes a syndromic form of ASD with unique facial dysmorphism [70], whereas mutations in the NatA complex subunit NAA15 cause a syndromic form of ASD with multiple congenital anomalies including craniofacial, neuromuscular, and cardiac complications [71]. Such families may not have been individually identified a priori to share similar genetic underpinnings prior to the advent of NGS technologies, which now enable patients to be stratified more precisely based on their genetic abnormality rather than phenotypic variability.

Importantly, accurate genetic diagnosis is critical for determining potential therapeutic approaches for patients with ASD. One area where the impact has been most recognizable is in ASD related to branched-chain amino acid deficiencies, for example, branched-chain keto-acid dehydrogenase kinase deficiency, in which mutations in BCKDK were identified. These mutations cause loss of function of BCKDK, itself a repressor of branched-chain amino acid degradation, and therefore patients have a concurrent deficiency of BCAAs. In murine models, supplementation of knockout mice with BCAAs significantly improved their neurologic phenotypes, suggesting that patients with BCKDK mutations may benefit from dietary supplementation of BCAAs to counteract the elevated degradation caused by the genetic mutation [31]. Similarly, ASD patients with a wide variety of comorbidities (e.g., sleep disorders, seizures and metabolic and immune abnormalities) have been found to have imbalances in compounds that could easily be rectified by dietary intervention, such as folate, carnitine, cobalamin, etc. [72]. More recently, one case–control randomized trial has demonstrated that supplementation with essential fatty acids, carnitine, digestive enzymes, and a hypoallergenic diet (e.g., gluten, soy, and casein-free) all improved ASD symptoms, including non-verbal IQ and nutritional status [73]. Therefore, as more cohorts of patients continue to be evaluated at the genomic and epidemiological levels, the future of ASD research can lead to novel tools and therapies that improve stratification and clinical management of patients based on their genomic information, ushering in an era of personalized medicine for ASD.

5 Conclusion

Autism spectrum disorders (ASD) are a heterogeneous group of disorders characterized by clinical comorbidities and extreme genetic heterogeneity. While a lot has been achieved to understand the molecular and genetic etiology, there is still a long way to go to understand how perturbations in genes ultimately lead to an ASD phenotype. Importantly, further studies may also reveal genetic markers of the development of different physical comorbidities, which can help in patient stratification and early intervention in cases predicted to become severe. Thus, as future studies are conceived, they ought not to only focus broadly on ASD patients across the entire spectrum, but also on important concepts such as data sharing and collaborations to aid in the interpretation, and eventually treatment of ASD across the globe.