Introduction

Intellectual disability (ID) or delayed psychomotor development are by far the most common reasons for referral to genetic services, and most severe forms are caused by single genetic defects. In Western populations where parental consanguinity is rare and families are usually small, most affected individuals are of sporadic cases. As shown by array CGH [1] and more recently, whole exome (WES) or whole genome sequencing (WGS) of patients and their parents [2, 3], dominant de novo copy number variants (CNVs) and mutations in single genes account for ID in the majority of these individuals [4, 5]. Autosomal recessive inheritance was rarely observed, at least in the first larger NGS-based trio studies of this kind [6, 7], where low sequencing depth may have hampered the identification of compound heterozygotes (discussed in ref. [5]). In a recent comprehensive study of children with severe developmental disorders, autosomal recessive defects accounted for 11.7% of all cases with a clear molecular diagnosis, whereas apparently disease-causing autosomal dominant de novo mutations were seen almost 5 times as often [8]. Not long ago, a meta-analysis of 2104 trios identified 10 novel genes for ID [9].

These observations, as well as the availability of conceptually simple ‘Trio sequencing’ strategies for their identification, explain why in recent years, genetic research into ID and related disorders has been dominated by the de novo mutation paradigm [2, 8, 10,11,12]. Most of the dominant de novo mutations identified are inactivate or one copy of a deleted haploinsufficient gene, whereas gain-of-function or dominant negative mutations seem to be much rarer. Since only a minority of the ~20,000 protein-coding human genes are dosage-sensitive [13], it is not surprising that according to recent estimates, there may be less than 500 genes in the human genome where functional loss of one copy is associated with ‘autism spectrum disorder’ (ASD) [14, 15], a collective term that includes ID and autism. ‘Trio sequencing’ of individuals with ID/ASD and their parents has already identified more than 400 of these genes [9, 16, 17].

In contrast, the molecular elucidation of recessive forms of ID is still in its infancy. By performing large-scale autozygosity mapping in unrelated consanguineous families, we were the first to show that autosomal recessive ID (ARID) is extremely heterogeneous [18, 19]. Extrapolation from the hitherto identified ~120 genes for recessive ID on the human X-chromosome [20] suggests that mutations in more than 3000 human genes may be associated with autosomal recessive ID (ARID) [4]. To date, less than 600 of these genes have been identified [16, 17].

Only recently, ARID has gained popularity as a promising target for research into the development and function of the human brain (e.g., see ref. [21]). Yet, the elucidation of ARID is also of considerable importance for global health care, which is still widely disregarded. Recent population studies have shown that the prevalence of ID is highly correlated with the frequency and the degree of parental consanguinity. In the offspring of double cousin or uncle-niece unions, ID is 3 to 4 times more common than in children of unrelated parents [22,23,24,25], and there is compelling evidence for the involvement of recessive gene defects (e.g., see refs. [5, 25,26,27]). Over 1 billion people live in countries where consanguineous marriage is common [28], and it has been estimated that couples related as second cousins or closer and their offspring account for 10.4% of the global population [29].

In outbred Western populations, the search for recessive causes of ID has turned out to be tedious (e.g., see refs. [30, 31]). Even trio sequencing in several thousand affected families yielded only a modest number of novel ARID genes [8, 32]. In 2011, we performed large-scale high-throughput sequencing in cohorts of consanguineous families to speed up the search for novel ARID (candidate) genes [33] but it took several years before other groups joined in [34,35,36,37,38,39].

Here we report on WES and WGS studies in 404 consanguineous families with ID, about three times as many as analyzed 5 years ago by our group [33]. The vast majority of these are from Iran, where almost 40% of all children have related parents [40,41,42,43]. These investigations shed more light on the genetic causes of ARID. Moreover, they significantly broadened the basis for the diagnosis and prevention of cognitive disorders in Iran and beyond.

Materials and methods

This study comprised 404 out of 757 families with ≥ 2 affected individuals recruited during the past 10 years. Initial screening revealed Fragile X syndrome in 46 (6.1%) of these families which were excluded. In other families, different monogenic causes of ID have been identified prior to the NGS era (reviewed by refs. [5] and [44]), and many families could not be reached or were excluded because of poor parental cooperation, borderline ID or genetic heterogeneity.

From each family of our cohort, one affected patient was selected for sequencing.

Genomic DNA (gDNA) was extracted from peripheral blood, the quality of which was controlled by Nanodrop2000 (Thermo Scientific), and approximately 2 µg gDNA was used for constructing deep sequencing libraries. In the course of this study, four different deep sequencing protocols were employed, i.e., target enrichment sequencing (TES) using the Illumina GAII sequencer, whole exome sequencing (WES) on the Illumina HiSeq2000 sequencer, and whole genome sequencing (WGS) on the Illumina HiSeq X Ten sequencer (Macrogen) followed by sequence alignment and variant calling using the DRAGEN infrastructure and pipeline (www.edicogenome.com). For some of the families, WGS was (also) performed by Complete Genomics, following the service provider’s standard procedure, using short-read paired sequencing and the CGATM tools for data analysis (www.completegenomics.com/). Details about the performance of these protocols, including the average coverage depth (non-redundant reads only) per sample and the percentage of the targeted coding regions covered by 10, 20, or 30 reads, are shown in the Table S6 and discussed below.

We analyzed TES and WES data by using our previously published Medical Resequencing Analysis Pipeline (MERAP) [45] (for details, see Materials and Methods Supplement; MERAP procedure). A particular strength of this pipeline is the detection of small, medium-sized, and large copy number variants. Therefore, and because homozygous CNVs turned out to be very rare in the families studied, array CGH was soon discontinued.

All potentially ID-causing variants detected by NGS were validated by Sanger sequencing, and for those identified by WES and WGS, co-segregation studies were performed, including all available and informative family members. A novel algorithm was developed to detect runs of homozygous markers (ROHs) encompassing possibly disease-causing mutations in the exome and genome of ARID patients, and comparison of ROHs spanning identical mutations in apparently unrelated families revealed shared haplotypes for all of them (for details, see Fig. 2, Supplement and Table S7). To further enrich them for pathogenic mutations, variants were also filtered in several other ways (for details, see Table S1), including pathogenicity prediction for missense variants using four established prediction tools, and selected for absence or very low allele frequencies in the ExAC database and our own in-house cohort which largely consists of Iranian families. Moreover, we have shown that novel ID genes are functionally related to known ID genes, either by protein-protein or through regulatory interaction (for details, see Materials and Methods, Supplement). Variants in known genes were scored using the ACMG variant interpretation guidelines [46]. Of these, 83.3% were classified as pathogenic or likely pathogenic. A few variants of uncertain significance (VUS) were retained as likely relevant based on a variety of criteria (e.g., low allele frequency or not even listed in ExAC; confined to a single family in our in-house database; called as pathogenic by at least 2 out of 4 relevant algorithms [PolyPhen2, SIFT, MutationTaster, CADD]; distinctive clinical phenotype; location within ROHs; supporting evidence from functional studies and expression data). The same criteria were also employed to classify mutations in novel candidate genes. Nevertheless, rigorous confirmation of the pathogenicity of these variants will have to include future functional studies and/or the identification of identical variants in other affected families.

Fig.1
figure 1

Protein-protein interaction network linking 39 known (green nodes) and 48 novel genes (orange nodes) for recessive forms of ID that were identified during this study. Five genes have been identified as well by other studies during the course of this work (light yellow nodes). Interactions were retrieved from the ConsensusPathDB resource (Kamburov U, Stelzl H, Lehrach R Herwig. Nucleic Acids Res. 2013). Known ARID genes refer to recent publications [17, 33,34,35] and are labeled according to the number of supporting references (dark green = high number of references). Enriched protein complexes and pathways are shown as colored clouds

To validate previously reported ARID candidate genes [33], we have also generated fly models and performed behavioral tests [47, 48]. Other previously identified ARID candidate genes could be confirmed by identifying allelic mutations in unrelated families (for details, see Materials and Methods Supplement; Fly Models, Drosophila behavior testing and Table S3).

Results

In 219 out of 404 families investigated (54.2%) we identified likely disease-causing DNA variants in novel candidate genes and in known genes, all co-segregating with ID. As expected for affected offspring of healthy consanguineous parents, the vast majority of these turned out to be autozygous for autosomal recessive defects. Compound heterozygosity was confined to a single family (M135, see Table S1) with a frameshift and a missense change in the MADD gene, which has a role in synaptic vesicle transport.

Likely disease-relevant variants in known or novel X-chromosomal (candidate) genes were found in 26 (23 genes) out of 219 consanguineous families (11.9%). For all of these, inheritance patterns were compatible with X-linkage. Pedigrees of all families with mutations in novel candidate genes are shown in Fig. S1. Five of the novel candidate genes had not been implicated in ID before (see Tables S1), two variants were identified in known non-ID disease genes (DIAPH2 and XPNPEP2), and two (KIF4A [49] and WDR13 [50]) had been linked to ID in a single family and are confirmed as X-linked intellectual disability (XLID) genes by this study. These data suggest that including Fragile X syndrome, X-linked gene defects may account for almost 18% of the consanguineous families with ID in Iran (see also refs. [51, 52]). Pathogenic mutations in DDX3X and PHF8 were observed in two affected brother pairs (families M030 and M9100013, respectively, see Table S1) but not in blood of their mothers, suggesting maternal germ cell mosaicism. A similar frequency of de novo mutations shared by siblings had been reported before [14].

For 26 autosomal and X-chromosomal genes we found allelic mutations in two or more families (Table 1). Most of these are known ID genes, and several had already been described in our previous study [33]. Four genes (IPP, ITGAV, RNFT2, and TTC5) had not yet been linked to any disease and PIDD1 only very recently [39]. For two known disease genes (AK1 and ALS2), an association with ID had not been reported before.

Table 1 ID and non-ID disease genes mutated in two or more consanguineous families

Two hundred and fifteen different, likely disease-causing variants were identified, 127 in known and 88 in novel (candidate) genes (Table 2). Of note, 11 out of 127 likely causative variants were found in genes that had been previously implicated in diseases other than ID. Of the 127 variants observed in known genes, 57 (44.9%) were loss of function (LOF) mutations including large deletions, stop-gain, frameshift, extension, and splice site variants, while 70 (55.1%) were missense variants. With 42%, the proportion of LOF variants was slightly lower in families with novel (candidate) genes and the proportion of missense mutations or small in-frame deletions was a little higher (58%). While these differences are not statistically significant, they might indicate that our criteria for selecting missense mutations in novel genes were slightly too permissive. On the other hand, it is noteworthy that in a recent study of de novo mutations causing ‘autism spectrum disorder‘ (ASD) [14], the inferred relative contribution of gene-disrupting (43%) and missense mutations (57%) was very similar to our findings.

Table 2 Likely causative variants observed in 219 identified families

There was almost no overlap between the genes implicated in ARID by our present study and 847 genes thought to be functionally redundant for which homozygous mutations have been recently identified in healthy adults with related parents [53]. Only 4 of the genes mutated in our families (CLN3, CLIP1, POMGNT1, and SASS6) were listed as potentially redundant, but there is solid evidence linking all four to recessive cognitive disorders. CLN3 (OMIM #204200) and POMGNT1 (OMIM #613151) are known genes for neuronal ceroid lipofuscinosis and congenital dystroglycanopathy with mental retardation, respectively, whereas for CLIP1 and SASS6, homozygous deleterious mutations have been identified in at least two unrelated ARID families (this study and ref. [54]).

Further, albeit indirect support for the reliability of our findings had been obtained from the confirmation of previous results. Of the 50 novel ARID candidate genes presented previously [33], > 30 have been firmly implicated in ARID through identification of additional families with allelic mutations, our investigation of fly models (Table S2 and Fig. S2) or in other ways (for a comprehensive overview, see Table S3). For many of the remaining candidate genes, mouse models with behavioral abnormalities have been reported (e.g., see http://www.informatics.jax.org). Moreover, the introduction of MERAP, a comprehensive Medical Resequencing Analysis Pipeline with its integrated Logit pathogenicity score [45] has greatly improved the identification and ranking of likely disease-causing sequence variants. Therefore, we believe that eventually, most of the candidate genes presented here will be confirmed. Indeed, this expectation has already been met for a variety of (former) candidate genes identified in the course of this study, including FMN2 [55], CLIP1 [56], CAPN10 [57], MFSD2A [58, 59], SLC6A17 [60], HNMT [61], DDX3X [62], TAF6 [34], and TAF1 [20, 63], which have been recently published by us and/or other groups.

ARID is often associated with microcephaly

In our previous study [33], novel forms of ARID had been considered as non-syndromic if index patients showed no obvious clinical symptoms other than ID. In many of these families, however, re-examination including affected siblings revealed additional clinical signs that had been overlooked before. In the present study, thorough clinical examination of all affected family members and comparison with families carrying allelic mutations allowed us to classify as syndromic 200 out of 219 families with a defined disease-causing variant. In 86 of the 219 ‘identified’ ARID families, the average occipito-frontal diameter (OFC) of affected individuals was at least 2 SD lower than the mean, and mutations in 30 novel candidate genes for ID were found to be associated with microcephaly (see Table S1, column G, and clinical description, Text File S1). Very severe microcephaly, with OFC < −7 standard deviations (SD) [64] was observed in families with homozygous, likely damaging mutations in the genes PPP1R35, GUF1, METTL5, PUS7, and TBC1D23 (see more detailed information in Supplementary Text).

ID with moderate microcephaly (OFC: −5 SD to −3 SD) was observed for 11 novel genes. For three of these, a second affected family with an allelic mutation has established their involvement in ID and microcephaly.

Defects in three genes, i.e., SP2, CLPTM1, and MADD, were found to be associated with enlarged head size. The most striking head enlargement (OFC: + 4 and + 5 SD, respectively) was observed in two children with mild to moderate ID (see family M135 in Supplementary Text S1) and compound heterozygosity for two allelic MADD mutations.

Of note, in two of three genes previously linked to non-ID disorders, PLIN1 (OFC: −3.5 SD) and YARS (OFC: −9.5 SD), we observed likely causative variants associated with microcephaly, and for AK1 mutation, an association with enlarged head size has been observed.

Epilepsy is also common, but autism is rare

Epilepsy was observed in 62 families of our cohort (28%), involving 33 known and 29 novel candidate genes (see Table S1, column H, and Supplementary Text). Of note, three out of 33 known genes (ALS2, FDPS, and XPNPEP2) had not been linked to ID before. Thus, after microcephaly, epilepsy was the most common additional finding in families with ARID. As judged from MRI results, which were only available for a minority of the ARID families, structural brain abnormalities and/or leukoencephalopathy are also fairly common (Table S1).

Prominent signs of autism were only present in 8 of the 219 families, involving 5 known and 3 hitherto unknown ID genes. These findings corroborate our earlier observation that compared to sporadic forms of ID seen in outbred populations, autism is rare in patients with recessive forms of ID [33]. Among the relevant known ID genes, four genes (ADSL, SHANK3, GRM1, and CNTNAP2) have been implicated in autism before. Novel ID- and autism-associated (candidate) genes included SP2, TRIM47, and EZH1. The association of SP2 mutations with autism and large head size is noteworthy because it has been reported that autistic children tend to have large brains [65]. SP2 is a cell cycle regulator gene, in which, deletion leads to the interruption of neurogenesis in embryonic and postnatal brain [66]. TRIM47 is expressed in fetal astrocytes and may be involved in brain development [67]. The homozygous truncating mutation observed in the EZH1 gene is of particular interest. Heterozygous de novo mutations in the paralogous EZH2 gene are associated with Weaver syndrome [68], characterized by developmental delay, overgrowth and dysmorphic signs. Both EZH1 and EZH2 catalyze mono-methylation, di-methylation, and tri-methylation of histone H3 at lysine 27 (H3K27me2/3) [69], but EZH1 is less abundant in embryonic stem cells and has weaker methyltransferase activity. We show here that homozygous loss of EZH1 function is also associated with overgrowth (see clinical description of family M8800071 in Supplementary Text and Fig. S3), but otherwise there was little phenotypic overlap with Weaver syndrome. In another family with ID and autism, we found a likely disease-causing variant in the XPNPEP2 gene. XPNPEP2 may be involved in cleavage of neuropeptide Y [70], a neuromodulator implicated in controlling the energy balance and behavior.

Allelic mutations causing recessive or dominant ID

In six previously described genes for autosomal dominant ID (ADID) or related disorders (CACNA1C, SCN8A, SETBP1, SHANK3, ATP1A3, PRRT2), we have identified recessive sequence variants that co-segregated with ID in consanguineous ARID families (Table S1). Four of these are likely mild missense mutations, as evidenced by moderate Logit pathogenicity scores (see Table S1). This may explain why heterozygous carriers in the respective families are healthy and only homozygotes have ID, as shown for a Pro→His mutation in CACNA1C, an Arg→His mutation in SCN8A, a Glu→Gly mutation in SETBP1 and a Val→Ala mutation in SHANK3 (Table S1). Dominant ATP1A3 mutations (see OMIM *182350) have been identified in dystonia, alternating hemiplegia of childhood and in the severe CAPOS syndrome. The Arg476Cys variant found in family M204 has high pathogenicity scores, but has not been linked to disease before. It is listed 65 times in the ExAC database, but exclusively in heterozygotes. In family M204, the phenotype of two homozygous females born to healthy second cousin parents overlapped with CAPOS syndrome, including severe ID, cerebellar ataxia with quadrupedal gait, short stature (< 3% ile) and in one, seizures since infancy. Reduced body size has also been noted in a mouse model for this disorder (see OMIM *182350).

For PRRT2 (see OMIM *614386), heterozygous loss-of function (LOF) mutations have been described in families with three different dominant disorders, i.e., infantile convulsions with paroxysmal choreoathetosis, episodic dyskinesia and infantile benign familial seizures. A recurrent c.649dupC frameshift mutation in the PRRT2 gene has been identified as the most common cause of all three conditions, which have been observed in different members of the respective families [71]. In family M003 reported here, homozygosity of c.649dupC is associated with moderate ID, strabism, seizures, and spasticity; their parents are healthy and do not have a history of infantile seizures. To our knowledge, homozygous truncating PRRT2 mutations with ID have not been described before.

For TBC1D23 and HIVEP3, presented here as novel ARID (candidate) genes (families M268 and M8700057 in Table S1), heterozygous de novo mutations have been described in individuals with autism and schizophrenia, respectively [17, 72].

Discussion

Many of the ~570 genes hitherto implicated in ARID [5, 16, 17, 44] code for metabolic enzymes, and their defects often cause severe or even lethal inborn errors of metabolism. In our study, we have focused on familial ID, and many affected individuals were recruited as adolescents or even adults. This may explain why in our cohort, the proportion of families with metabolic defects is much lower. Detailed information about the function of all novel ARID (candidate) genes is provided in Table S1 (see column U). Grouping them into functional classes is somewhat arbitrary because many genes have pleiotropic functions. As previously reported [33], ARID is often caused by mutations in genes with essential ‘housekeeping functions’ such as DNA transcription and translation, cell division, protein degradation, or energy metabolism. Here we show that the spectrum of recessive gene defects leading to ID is much wider. In particular, novel ARID genes are tightly connected with known ARID genes at the level of protein–protein interactions often functioning in fundamental biological processes such as TFIID, elongation and 7SK RNP complexes (Fig. 1). A detailed functional characterization of the novel ARID (candidate) genes is summarized in Tables S4.1–4.7 and different aspects are shown in Figs. S4–S6.

Fig. 2
figure 2

Regional clustering of recurrent ARID mutations suggest short half-life and rapid replacement of serious recessive disease-causing gene defects in Iran

Apart from the functional classes mentioned above, defects involving synaptic function, cell migration, cell signaling and remarkably, innate immunity form visible clusters (see Table S1, column T). In total, 7 of the novel ARID (candidate) genes presented here have a role in innate immunity, and at least three (HIVEP3, PIDD1, and TMED7-TICAM2) are upstream regulators of NF-kappa B (see Table S1). Two ARID genes, CC2D1A [73] and TRAPPC9 [74] have been linked to innate immunity before; inactivation of these genes leads to up-regulation of NF-kappa B signaling. Recently, excessive NF-kappa B signaling has also been implicated in the pathogenesis of Rett syndrome and presented as possible new route for alleviating the course of this severe X-linked neurodevelopmental disorder [75]. Using homozygosity mapping, Gamsiz ED et al., (2015) identified novel rare, recessive loci, which include a protein truncating mutation in CC2D1A in consanguineous families with syndromes such as autism symptoms [76].

Functions of ARID and ADID genes

Dominant de novo mutations in fragile X mental retardation protein (FMRP)-interactors and chromatin-remodeling genes are common in sporadic forms of ID and autism, and the same is true for genes coding for post-synaptic density proteins (e.g., see refs. [14, 15, 72, 77]). In contrast, only two of the novel ARID genes identified by our present study qualify as chromatin-remodeling genes (SMYD5 and EZH1, with ATF7IP as possible third). Three of the novel ARID (candidate) genes (CTNNA2, FSCN1, and ITSN1), and a known disease gene reported with non-ID phenotype in OMIM (AK1) code for postsynaptic density proteins (http://www.genes2cognition.org/db/GeneList) [78]. ALS2, another known non-ID disease gene, is the only ARID gene among the top 40 FRMP targets that have been linked to ID or autism [79]. Most FMRP targets and genes implicated in sporadic forms of ID and autism code for exceptionally long, highly brain-expressed proteins [80] whereas ARID proteins tend to be shorter, and as shown here, they are less often involved in multi-protein complexes. These differences may explain why protein interaction and regulatory networks for recessive forms of ID show little overlap with published ones (e.g., see refs. [81, 82]), (see Fig. S7 and Tables S5.1–5.13).

Geographical clustering of recurrent mutations

The scarcity of compound heterozygosity in Iran may reflect the tradition to marry within families or large clans with ‘private’ recessive defects (e.g., see refs. [52, 83]). This is supported by regional clustering of apparently unrelated families with identical mutations. Six likely disease-causing variants were observed in two different families of this cohort. One of these (AP4M1 p.E193K) had already been described in our previous study (family M004 in ref. [33]), and a mutation in another, seventh gene (PRRT2 p.Arg217Profs*8, family M003, Table S1) had also been observed before (family M010 [33]). Most of the families carrying matching mutations turned out to be from the same or neighboring provinces or even the same town (see Fig.1). Haplotype analyses confirmed their identity by descent, even for three families with the recurrent AP4M1 p.E193K mutation living in different regions of Iran, thereby ruling out the possibility of a mutational hotspot (see Table S7). To our knowledge, none of these recurrent mutations has been described outside Iran so far. On the other hand, the relatively small size of the shared haplotypes argues against the possibility that these mutations are evolutionarily young. While in consanguineous demes, genetic drift will rapidly lead to loss of internal diversity at a given locus, there is evidence that the overall gene diversity in the population as a whole will remain remarkably stable [84], which may explain these observations.

Spectrum of ARID genes in neighboring populations

Recently, several groups have looked for genetic defects causing neurodevelopmental disorders in neighboring countries where parental consanguinity is also common. We have compared the outcome of our previous [33] and the current investigations with combined data from Arab countries [34, 37, 38, 85], and with the results of studies conducted in Turkey [35] and Pakistan [36, 39]. Of the 228 known and novel ARID (candidate) genes carrying mutations in our (mainly Persian) cohort, only 28, 11, and 25, respectively, were found to be mutated in the cohorts from Arab countries, from Turkey and from Pakistan (Fig. 3, Table S8).

Fig. 3
figure 3

Genes mutated in autosomal recessive ID and related neurodevelopmental disorders in predominantly Iranian (cohort A, 228 genes [5, 33, 44], and this study), Arabs (cohort B, 252 genes [34, 37, 38, 85]), Turkish families (cohort C, 67 genes [35]), Pakistani Families (cohort D, 100 genes [36, 39]) (see Table S8)

Of note, no single ARID gene was found to be mutated in all 4 cohorts, which corroborates the conclusion that in highly consanguineous populations, most severe recessive disease-causing mutations are confined to clans and do not spread much farther, and that in ARID, the locus heterogeneity is extremely high [19]. Thus, compared to the estimated 500 genes involved in ADID [14, 15] the number of ARID genes must be large and is likely to run into the thousands [4].

Marrying within families or clans should also favor the regional clustering of genetic risk factors for related complex diseases. Thus, genetic factors predisposing to other neuro-psychiatric disorders are not likely to spread in these countries either, and common associated markers are expected to be rare, not only in conditions such as autism and schizophrenia where GWAS cannot work due to reduced fecundity of affected individuals and rapid turnover of the predisposing genetic factors [21, 86].

The quest for ID genes: an unaccomplished mission

In view of the rapidly growing capacity for whole exome or whole genome sequencing, the detection of genetic variants is no longer a problem, but assessing the clinical relevance of genetic variants is still a bottleneck. Algorithms predicting the pathogenicity of missense variants in known disease-associated genes have improved, but their reliability is still limited. Gene- or pathway-specific functional tests have been employed to study mutations implicated in immunodeficiencies [87] and defects in the blood coagulation pathway [88], but devising analogous functional tests for neuropsychiatric disorders [21] is a much greater challenge. For ARID, this approach is no realistic option given the plethora of functionally different ARID genes and the high proportion of families with likely causative mutations in novel genes, indicating that most ARID genes have not been identified yet (see also Fig. 3).

Therefore, the search for gene defects that cause or predispose to ID and/or related disorders has to remain a priority of research into neuropsychiatric disorders until most of the underlying gene defects are known. In line with previous considerations [89], our results suggest that there are more recessive than dominant forms of ID, and their overdue systematic elucidation will generate a wealth of new data on the development and function of the central nervous system. Searching for multiple allelic mutations in cohorts of consanguineous families with two or more affected children is the strategy of choice for identifying hitherto unknown ARID genes. The success of this approach is primarily dependent on the number of families studied, and at least in principle, it does not rely on functional clues which are often scant or absent. The identification of most or all single-gene defects causing ID and related neurodevelopmental disorders will be a major step towards understanding the function of the human brain in health and disease.

Where are the missing mutations?

We detected likely disease-causing variants in 219 out of 404 consanguineous ID families (54.2%). This finding was comparable to that of our previous, smaller study (57% [33]), but lower than in two other recent investigations [34, 35]. The lower mutation yield of our present study may reflect the inclusion of families which had been unsuccessfully screened by targeted exon sequencing before. Of note, 11 families harbored variants in genes known to be associated with diseases other than ID.

These studies and the unexpected paucity of compound heterozygosity highlight the importance of linkage information and suggest that combining autozygosity mapping with WES is a superior strategy for identifying disease-causing mutations in consanguineous families. Autozygous genomic segments harboring most of the recessive mutations can also be identified by sequencing several individuals per family, which is a time-saving, albeit more expensive alternative to prior linkage studies [35].

Not unexpectedly, the additional diagnostic yield of WGS was limited suggesting that the vast difference between WES and WGS reported by others may in the first place reflect technical differences rather than indicating that WGS is fundamentally superior to WES. In principle, of course, WGS should allow to detect all kinds of mutations everywhere in the genome, not only in coding regions and exon-flanking splice sites. In practice, however, this advantage is rather theoretical as long as we cannot reliably identify functionally relevant sequence variants in the non-coding portion of the genome, including deep intronic or even exonic mutations affecting splicing, but also enhancer, repressor or insulator mutations in intergenic regions. Recent studies suggest that exonic variants enhancing or silencing splicing [90] or generating novel splice sites [91] are important ‘sinks’ of disease-causing mutations, and novel algorithms have been developed that promise to facilitate their identification.

Other mechanisms that may cause ID in sporadic patients such as dominant de novo mutations, polygenic inheritance or epigenetic changes that are not directly related to changes in the DNA sequence are unlikely to cause ID in multiple children of healthy consanguineous parents, and recurrent parental germline mutations are also rare.

Many of the mutations missed by the afore-mentioned studies may involve non-coding regulatory sequences. Therefore, their identification and functional characterization is of central fundamental and diagnostic importance. For defining regulatory sequences in the genome ([92] and references therein) it is advantageous to study defects with highly specific, recognizable phenotypes; thus, ID does not qualify. It is also unlikely that algorithms for assessing the functional relevance of non-coding sequence variants will be available soon. However, searching for regulatory mutations in autozygous genome intervals of large ARID families is a viable option, as previously documented for X-linked ID [93]. Large ARID families from our cohort where even WGS failed to identify a likely causative mutation should be particularly suitable for this purpose.

The spectrum of ARID gene defects identified in predominantly Iranian, Arabian, Turkish, and Pakistani families shows little overlap, as illustrated above. In countries where parental consanguinity has been practiced for many generations, deleterious recessive mutations are expected to differ between demes or clans, and the gene defects present in the entire population are a more or less stochastic sample from the large pool of gene defects that can give rise to ARID. Secondly, in samples of limited size, the frequency of specific gene defects causing ARID will differ from their prevalence in the population. Given the relatively small cohorts of ARID families studied to date and the very high number of potential ARID genes, this sampling error is presumably large. Indeed, this is supported by the limited number of overlapping gene defects identified in two separate cohorts of families sampled from the Iranian population [33] (this study, see Fig. S8). Therefore, it may be possible to consider these cohorts and those from Arab countries, Turkey and Pakistan as independent samples from the same gene pool, and the combined outcome of these studies may provide a sufficiently broad basis for estimating the total number of ARID genes [94].

In conclusion, this study has identified numerous novel ARID genes, as well as likely ID-causing mutations in a large number of genes that had not been implicated in ARID before. It revealed that most forms of ARID are syndromic, with microcephaly being present in almost half of the families, while autism is rare; and that genomic sequencing and autozygosity mapping in consanguineous families is the strategy of choice for identifying novel ARID genes. Our study showed that the implementation of WES or WGS might be an efficient diagnostic strategy for countries where parental consanguinity is common and recessive disorders are a central problem of health care. In outbred Western populations, large consanguineous families are rare, and even the largest pilot studies may be too small for elucidating recessive disorders in a systematic fashion.