Introduction

Over the past two decades, much progress has been made in identifying the causal variants or mutations and candidate genes for Mendelian (single gene or monogenic) disorders through mainly traditional linkage studies (Botstein and Risch 2003). The terms ‘variant’ and ‘mutation’ have been used interchangeably throughout the literature; however, ‘variant’ will be used consistently throughout this article. Mendelian or monogenic disorders encompass ‘classical’ disorders such as Freeman–Sheldon syndrome (Ng et al. 2009), Fowler syndrome (Lalonde et al. 2010) and the monogenic form of complex diseases such as autosomal-dominant amyotrophic lateral sclerosis (Johnson et al. 2010b) and hypercholesterolemia (Rios et al. 2010). Currently, causal variants for approximately 3,000 Mendelian disorders have been identified (Online Mendelian Inheritance in Man, http://www.ncbi.nlm.nih.gov/omim).

Genome-wide linkage studies followed by positional cloning have been very successful in identifying causal variants for Mendelian disorders because of the perfect segregation pattern of the causal variant with the disorder according to Mendelian inheritance patterns (e.g. autosomal dominant, autosomal recessive and X-linked). This perfect segregation pattern is due to complete or almost-complete penetrance of the causal variant. In genome-wide linkage studies no prior hypothesis is needed as evenly distributed genetic markers, for example several hundred microsatellites or several thousand single polymorphisms (SNPs) are sufficient to cover the whole genome. There are only a limited number of recombination events within a family or pedigree. The genetic markers will reveal genomic regions which are co-segregated in affected individuals. This could then be followed up by positional cloning to identify the causal variants and candidate genes within the genomic regions, which can be up to tens of centimorgans (cM). On the contrary, candidate-gene based linkage studies require a prior hypothesis and are not designed to reveal novel genomic regions for Mendelian disorders (Botstein and Risch 2003).

Classical linkage studies are the main tool for elucidating the genetics of Mendelian disorders; however, not all of these disorders are amendable to this study design. Homozygosity-mapping, on the other hand, is a more powerful and effective approach to study recessive disorders in consanguineous families (Harville et al. 2010; Pang et al. 2010; Iseri et al. 2010; Collin et al. 2010). For those disorders that are not amendable to these two conventional approaches, their causal variants remain elusive. These disorders include (a) ‘extremely rare’ Mendelian disorders where only a small number of cases are available, (b) unrelated cases from different families and (c) sporadic cases due to de novo variants. For some Mendelian disorders, cases can occur sporadically by a de novo or new variant arising during meiosis and which is undetected in the parents (Table 1). We use the term ‘extremely rare’ to distinguish those Mendelian disorders which cannot be investigated by linkage studies due to their low incidence in the population from ‘rare’ disorders where an adequate sample size can still be collected for linkage studies. For extremely rare disorders, usually only several affected siblings in one family or several unrelated cases from different families are available for investigation. However, exome (the collection of all exons in the human genome) sequencing now offers new opportunities to study extremely rare disorders and sporadic cases (Table 1) as well as complex diseases (Li et al. 2010b).

Table 1 Summaries of exome and whole-genome sequencing studies of Mendelian disorders

Two recent review papers on exome sequencing of Mendelian disorders focused on variant filtering strategies (Ng et al. 2010c) and novel genomic techniques (Kuhlenbäumer et al. 2011). However, we review this area in a broader context and focus on several topics which have not been comprehensively discussed previously. In this paper, we start by discussing the need for exome sequencing of Mendelian disorders and the technological developments leading to the feasibility of this approach. We also recall the importance and value of interrogating the genetics of Mendelian disorders which tend to have been given less emphasis in the era of genome-wide association studies (GWAS) and then further elaborate on the application of exome sequencing in elucidating the genetics of Mendelian disorders and the recent advances achieved in the field. The pros and cons of currently employed variant filtering strategies will also be discussed. We also examine the advantages and challenges of exome sequencing in identifying causal variants for Mendelian disorders. Finally, as most of the known causal variants were found in exons (protein coding regions), we share our views on whether whole-genome sequencing is needed for Mendelian disorder research.

Why exome sequencing is needed

The linkage study design is unsuitable for extremely rare Mendelian disorders because of the difficulty in collection of an adequate number of affected individuals (of multi-generational pedigree) and families for a statistically powerful study. This approach is also not applicable for sporadic cases, for example Kabuki syndrome, an extremely rare autosomal-dominant Mendelian disorder with an estimated incidence of 1 in 32,000, where the majority of reported cases are sporadic (Ng et al. 2010a). As a result, the causal variant and candidate gene for Kabuki syndrome have remained unknown until recently. A total of 33 different causal variants in MLL2 were identified by Ng et al. (2010a) in 35 of 53 individuals affected with Kabuki syndrome. Additionally, in 12 of these individuals whose parental samples were available, their variants in MLL2 were found to have occurred de novo. Only ten of these individuals were investigated in the discovery study using exome sequencing to identify the causal variants in MLL2, and the exons of this gene were then screened in an additional 43 cases using Sanger sequencing (Ng et al. 2010a).

Similarly, most of the cases of Schinzel–Giedion syndrome have occurred sporadically suggesting that heterozygous de novo variants may have caused the disorder. This has now been further supported by identifying de novo causal variants in SETBP1 in four individuals affected with this disorder through exome sequencing (Hoischen et al. 2010). These de novo causal variants would not have been otherwise identified without exome sequencing. In contrast, although none of the causal variants in DHODH appeared to have occurred de novo for Miller Syndrome, it is still an extremely rare disorder (Ng et al. 2010b). Therefore, these disorders are intractable to the linkage study design. Collectively, these studies have demonstrated the advantages of exome sequencing over the linkage study design in situations where a small number of unrelated samples or sporadic cases are available. Up to ten samples have been previously interrogated by exome sequencing in discovery studies (Table 1).

Furthermore, the linkage study design is also not robust enough for Mendelian disorders with genetic heterogeneity (i.e. the causal variants are present in different genes) and phenotypic heterogeneity (i.e. diverse clinical or phenotypic manifestations leading to uncertainty in diagnosis of the disorder or ambiguity in phenotype). Similarly, these problems are well depicted in Kabuki syndrome which is likely a genetically heterogeneous disorder because not all the affected individuals have causal variants in the single candidate gene (MLL2) (Ng et al. 2010a; Paulussen et al. 2010). Nevertheless, causal variants in different genes have not yet been found to further support its genetic heterogeneity. Exome sequencing is more robust for disorders with a presumably genetic heterogeneity background. Kabuki syndrome is also characterized by phenotypic heterogeneity. To account for this, investigators have performed additional phenotypic stratification and ranking steps (Ng et al. 2010a). Initially, this study failed to identify a compelling candidate gene harboring causal variants in all the ten investigated individuals. However, by accounting for the genetic and phenotypic heterogeneity, the investigators successfully identified causal variants in MLL2 in a subset of individuals. This illustrates the additional challenges present in studying disorders with genetic or phenotypic heterogeneity. Other Mendelian disorders summarized in Table 1 also demonstrate varying degrees of genetic or phenotypic heterogeneity.

High-throughput sequence capture and sequencing technologies

The high-throughput sequence capture methods are able to isolate the collection of exons in a more efficient and cost-effective way than traditional PCR-based methods. Without these sequence capture methods, the approximately 180,000 exons in the human genome would require designing an equivalent or larger number of PCR primer sets to isolate and amplify (Ng et al. 2009), and it would therefore be costlier and time consuming to study the exome using PCR-based isolation methods. These high-throughput sequence capture methods are commercially marketed, for example the NimbleGen Sequence Capture technology (http://www.nimblegen.com/) and Agilent SureSelect Target Enrichment technology (http://www.home.agilent.com). These sequence capture methods allow researchers to target custom genomic regions of interest in the human genome for up to tens of megabases and also enable enrichment of the exome in a single experiment. This development coupled with the high-throughput sequencing data produced by next-generation sequencing (NGS) technologies ensures an adequate depth of sequencing coverage to accurately detect the variants in the exome or targeted regions (Mamanova et al. 2010; Turner et al. 2010; Koboldt et al. 2010; Metzker 2010; Shendure and Ji 2008).

The total size of the human exome is approximately 30 Mb which comprises approximately 1% of the entire human genome. Therefore, exome sequencing requires many-fold lesser amounts of sequencing data to achieve the desired depth of sequencing coverage for variants detection compared to whole-genome sequencing. As a result, exome sequencing has emerged as a more popular approach to study Mendelian disorders (Table 1). Although many recent studies are labeled as ‘exome sequencing’, the sequence capture methods employed are unable to completely isolate all the exons experimentally, i.e. a fraction of exons will be missed. Furthermore, the probes in sequence capture methods are designed based on the sequence information from gene annotation databases such as the consensus coding sequence (CCDS) database and RefSeq database; therefore, unknown or yet-to-annotate exons cannot be captured. Regions that are poorly mapped with short sequence reads due to paralogous sequences elsewhere in the genome have to be excluded as well (Ng et al. 2009). As such, the exome capture is not complete. The incomplete capture of the exome can create additional problems in identifying the causal variants and candidate genes for Mendelian disorders.

The exome sequencing studies have focused primarily on the approximately ‘30Mb sequences’ encompassing exons and splice sites using commercially available sequence capture methods. As such, these sequence capture methods have limited or no coverage of other important regulatory sequences such as promoters, enhancers, microRNAs and other annotated regulatory elements and evolutionary conserved non-coding sequences. For example, the Agilent SureSelect Human All Exon Kit covers 38 Mb of sequences corresponding to the exons and flanking intronic regions of 23,739 genes in the CCDS database (September 2009 release) and also encompasses 700 microRNAs from the Sanger v13 database and 300 non-coding RNAs (Walsh et al. 2010). Although this ‘all exon kit’ has expanded the coverage beyond the exome, the coverage of regulatory sequences is not complete and also raises a further question of why evolutionary conserved non-coding sequences are not included.

Some researchers may consider the limited coverage of important regulatory and evolutionary conserved sequences as one limitation of exome sequencing; hence, there is now an increasing demand to include these regions in future exome sequencing studies. Undoubtedly, it is advantageous to include as many annotated regulatory and evolutionary conserved sequences as possible where the causal variants might be found, but this will then add to the cost of sequence capture methods and sequencing as more sequences will need to be isolated and sequenced. Recently, the introduction of the Illumina TruSeq Exome Enrichment Kit has doubled the size of targeted regions to 62 Mb with more than 90% coverage of the exons or genes in the latest version of the CCDS and RefSeq database (http://www.illumina.com/products/truseq_exome_enrichment_kit.ilmn). However, in this scenario where there is a continuous demand to increase the coverage or size of targeted regions beyond the exome, whole-genome sequencing is probably a more viable option that has been adopted in some studies (Sobreira et al. 2010; Lupski et al. 2010; Rios et al. 2010). Ultimately, it will be more efficient and cost-effective to subtract the sequence reads in ‘unwanted’ regions by bioinformatic analysis after whole-genome sequencing than including the ‘wanted’ regions during the sequence capturing stages if the coverage of targeted regions continues to expand.

All the NGS technologies have higher base calling error rates than Sanger sequencing, although this can be remedied to some extent by increasing the depth of sequencing coverage to ensure minimal errors (Koboldt et al. 2010). An adequate depth of sequencing coverage is also critical for identifying heterozygotes such as de novo variants or heterozygous variants causing dominant Mendelian disorders or compound heterozygotes causing recessive disorders. Gilissen et al. (2010) used the Agilent SureSelect human exome kit in combination with ABI SOLiD sequencing to generate 3.6 and 3.4 gigabases of mappable sequence data for two patients with Sensenbrenner syndrome and achieved an average sequencing coverage of 67× and 59× for the exomes (Gilissen et al. 2010), while Wang et al. (2010) obtained an average coverage of 65× for four exomes affected with autosomal-dominant spinocerebellar ataxias and reported that approximately 97% of the targeted bases were covered sufficiently to pass their thresholds for variant calling (Wang et al. 2010). Thus, this depth of sequencing coverage was deemed sufficient for accurate detection of variants. This is critical for subsequent downstream analysis because base calling errors could mistakenly be thought of as rare variants. These artifacts will make the searching for causal variants and candidate genes more difficult if not properly accounted for.

The barcoding method allows multiplexing of up to tens of samples to be sequenced per instrument run and offers a cost advantage. The levels of multiplexing depend on the size of the targeted regions to be sequenced and the depth of sequencing coverage to be achieved. Given the continuous increase in the throughput of sequencing data generated by NGS technologies, where several hundred gigabases of data are generated per instrument run, barcoding of the samples will be more cost-effective and avoid over-sequencing of samples. Over-sequencing of samples would result in diminishing returns in accuracy gains in variants detection (Craig et al. 2008; Szelinger et al. 2011).

To conclude, technological developments have made exome sequencing more practical and affordable: from several samples needed for Mendelian disorders to hundreds of samples for complex diseases (Li et al. 2010b). These technological developments have been one of the main driving forces of the exome sequencing era with more than 20 studies being published on the subject in 2010 (Table 1) (Bilgüvar et al. 2010; Roach et al. 2010; Byun et al. 2010; Haack et al.2010; Bonnefond et al. 2010; Worthey et al. 2010). In addition, these technologies have also accelerated efforts in sequencing the previously identified linkage regions (Brkanac et al. 2009; Nikopoulos et al. 2010; Volpi et al. 2010; Rehman et al. 2010). Brkanac et al. (2009) applied sequence capture and NGS methods to sequence all the genes in a previously identified linkage region, chromosome 7q22-q32, for autosomal-dominant sensory/motor neuropathy with ataxia and identified a nonsynonymous variant in IFRD1 causing the disorder. Without these technologies, interrogating the linkage regions of several centimorgans using PCR and Sanger sequencing methods would be a daunting task. Therefore, targeted sequence capture followed by NGS should be performed to investigate the established linkage regions from previous studies.

The rise of complex disease research

Over the past 5 years, the genetics research community has focused studies mainly on dissecting the genetic basis of complex (non-Mendelian, polygenic or multifactorial) diseases and traits. Prior to this, studies of complex phenotypes have met with limited success using candidate-gene association and linkage study designs (Hirschhorn et al. 2002; Hirschhorn 2005). Although linkage studies have identified causal variants for thousands of Mendelian disorders, this approach is ineffective and unsuitable for complex diseases caused by complex interactions of multiple genetic and environmental factors. Nevertheless, significant progress has been achieved since 2005 through GWAS (Altshuler et al. 2008; Hindorff et al. 2009; Ku et al. 2010). Presently, more than 4,000 SNPs have been reported to be associated with various human complex diseases and traits (A Catalog of Published Genome-Wide Association Studies, http://www.genome.gov/26525384).

However, due to the indirect study design of GWAS being reliant on linkage disequilibrium, the causal variants remain elusive in most of the GWAS-detected loci. It is also more difficult to identify the causal variants for complex diseases resulting from multiple genetic variants of low penetrance. This is in contrast to Mendelian disorders which are caused by variants with complete (or nearly complete) penetrance showing a strong genotype–phenotype relationship. Therefore, despite the success of GWAS in unraveling thousands of statistically robust SNP associations, the causal variants and candidate genes for most complex diseases have not been convincingly identified (Altshuler et al. 2008; Hindorff et al. 2009; Ku et al. 2010).

In comparison with complex diseases, relatively slower progress has been made in identifying causal variants for Mendelian disorders during the peak period of GWAS research (2006–2009) until the first proof-of-principle study demonstrated the feasibility of exome sequencing to identify a known candidate gene for Freeman–Sheldon syndrome (Ng et al. 2009). Several reasons have been cited for Mendelian disorders receiving little attention in recent years. First, most of the Mendelian disorders with their causal variants and candidate genes which can be investigated by linkage studies have already been identified (Amberger et al. 2009). Other disorders are too rare to be investigated by linkage studies. Second, a powerful method to study extremely rare disorders or cases caused by de novo variants is not previously available. Although NGS technologies have been available since 2005, exome sequencing was not technically feasible and efficient until the advent of high-throughput sequence capture methods to isolate the exome (Mamanova et al. 2010; Turner et al. 2010). These problems are related to the Mendelian disorders themselves; however, other factors are more in favor towards complex diseases research and will be discussed further.

Third, it is due to the increased enthusiasm of researchers in pursuing complex diseases research after the notable success in the GWAS of age-related macular degeneration (Klein et al. 2005). The completion of the International HapMap Project, the advent of high-resolution genotyping microarrays, the collection of large sample sizes, and the development of powerful statistical analysis methods have led to the rapid increase in publications of GWAS since 2005 (Seng and Seng 2008). Furthermore, delineating the genetics of complex diseases such as metabolic, cardiovascular, autoimmune and chronic inflammatory and infectious diseases is believed to be more important from the public health perspective as these diseases affect a much larger fraction of the population than Mendelian disorders (McCarthy 2010; Musunuru and Kathiresan 2010; Baranzini 2009). Collectively, these factors have gradually attracted more attention towards complex diseases.

Why study Mendelian disorders

Research into the previously unexplained Mendelian disorders (i.e. where causal variants have not been identified) should become a priority now and in the near future for several reasons. First, Mendelian disorders as a collective make up approximately 7,000 known or suspected disorders and contribute significantly to the disease burden in society, even though they have been labeled as rare or extremely rare disorders compared with the more common complex diseases (Ropers 2007; Ropers 2010; Antonarakis and Beckmann 2006; Antonarakis et al. 2010). We further discuss the importance of studying Mendelian disorders from three aspects: (a) revealing genes for complex diseases and traits, (b) providing new biological insights and (c) identifying drug targets.

Studying Mendelian disorders can reveal genes and biological pathways that are associated with the development of complex diseases. This was illustrated in the identification of SNPs in WFS1 and TCF2 associated with the polygenic form of type-2 diabetes (Sandhu et al. 2007; Winckler et al. 2007). The WFS1 gene was prioritized as a candidate to be interrogated in candidate gene association studies because the rare variants in WFS1 cause a monogenic form of diabetes (Wolfram syndrome). Thus, WFS1 becomes a biologically plausible gene for polygenic type-2 diabetes (Sandhu et al. 2007). Similarly, rare variants in the TCF2 gene cause maturity-onset diabetes of the young (MODY) (Winckler et al. 2007).

In cases where defective MC4R was the leading cause of monogenic severe childhood-onset obesity, it was also found that common SNPs near MC4R were associated with fat mass, weight and risk of obesity and other metabolic-related traits (Loos et al. 2008; Chambers et al. 2008). Numerous GWAS-identified common SNPs which are associated with triglycerides, high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol levels were also found in the candidate genes causing the monogenic form of these lipid metabolism disorders (Kathiresan et al. 2008; Hegele 2009). The convergence of genes identified for Mendelian and polygenic diseases were also seen in other diseases such as Parkinson’s disease (Gasser 2009; Lesage and Brice 2009). Recently, the identification of TECR for non-syndromic mental retardation through exome sequencing also suggests that this gene should be further studied in patients with neurological and psychiatric diseases such as schizophrenia and autism. This study has implicated a potential candidate gene to be investigated in other diseases which may then provide important information for revealing common molecular pathways underlying the development of these diseases (Caliskan et al. 2011).

The discovery of causal variants and candidate genes responsible for Mendelian disorders will also help in understanding their biological function. For example, the discovery of causal variants in DHODH (which encodes the enzyme dihydroorotate dehydrogenase) for Miller syndrome has provided new insights into the role of pyrimidine metabolism in craniofacial and limb development (Ng et al. 2010b). The discovery of causal variants in TMG6 for spinocerebellar ataxias also provides further evidence to suggest its involvement in the pathogenesis of neurodegenerative diseases (Wang et al. 2010).

Much of the molecular biological research on amyotrophic lateral sclerosis is based on the discovery of causal variants in genes such as SOD1, TDP-43 and FUS that are responsible for the familial or monogenic form of this disease. Given its value in providing new biological insights, investigating the genetic basis of the monogenic form of complex diseases has received recent attention. A causal variant in a previously unreported gene has been identified for familial amyotrophic lateral sclerosis (Johnson et al. 2010b), and a nonsynonymous variant in VCP was identified through exome sequencing of two affected individuals in a family. This discovery provided new insights into the investigation and understanding of the molecular biology and pathogenesis of amyotrophic lateral sclerosis. The finding of causal variants in VCP for familial amyotrophic lateral sclerosis implicates defects in the ubiquitination/protein degradation pathway in motor neuron degeneration.

The potential discovery of new drug targets through studying the genetics of Mendelian disorders should also be emphasized. The discovery of drugs targeting PPARγ and KCNJ11 as a treatment for type-2 diabetes strongly supports this potential. The drugs used to lower cholesterol levels by inhibiting the enzyme HMG-CoA reductase (i.e. statins) were also discovered through studying familial hypercholesterolaemia (Brinkman et al. 2006). Finally, it also contributes to our understanding of human physiology, e.g. studying the Mendelian forms of hypertension have improved our knowledge of blood pressure and volume regulation (Luft 2003).

Currently, the revisiting of Mendelian disorders is mainly due to the ‘attraction’ of the exome sequencing approach and the ‘distraction’ of the disappointing GWAS results that explain only a small fraction of the heritability of complex diseases and traits (Manolio et al. 2009). Nevertheless, studying complex diseases should not be abandoned, as GWAS have also revealed new biological insights, such as unraveling the autophagy and interleukin-23 receptor pathways for Crohn’s disease (Mathew 2008; Cho 2008). A balance between Mendelian disorders and complex diseases research is needed, as research in one cannot be substituted by the other. The knowledge gained from studying Mendelian disorders and complex diseases will eventually complement each other and synergistically enhance our understanding of genotype-phenotype relationships.

A balance for Mendelian disorders and complex diseases

Over the past few years, enormous resources have been invested in research on complex diseases and traits where hundreds of GWAS projects were funded and many huge consortia established to tackle the genetics of these phenotypes (Voight et al. 2010; Teslovich et al. 2010). Several international projects such as the International HapMap Project and 1000 Genomes Project were initiated with the aim of providing useful resources for elucidating the genetics of complex diseases (International HapMap 3 Consortium 2010; 1000 Genomes Project Consortium 2010). Biobanks were also established to properly collect and store hundreds of thousands of biological samples for future investigation of complex diseases (Palmer 2007; Nakamura 2007; Fan et al. 2008). Fortunately, the desire and endeavor to study the genetics of complex diseases has also driven unprecedented developments in microarray (genotyping and sequence capture) and sequencing technologies. These developments have eventually enabled exome sequencing or whole-genome sequencing to be applied to Mendelian disorders.

Despite Mendelian disorder research being considered successful, more than half of the approximately 7,000 known or suspected Mendelian disorders identified based on clinical features have not yet been linked to their candidate genes harboring causal variants. New efforts to improve this include a recent initiative by the National Human Genome Research Institute (USA) to establish ‘A Center for Mendelian Disorders’ whose mission will be to take on the sequencing of Mendelian disorders. This center would be expected to solve the molecular basis of 40–50 disorders per year. In addition, this center will also coordinate the collection and distribution of samples for all remaining unexplained Mendelian disorders, for example by identifying samples within the community and obtaining commitments from the investigators who have samples for distribution to other groups who are able to do exome sequencing. This will facilitate and accelerate the effort to identify causal variants for as many of these disorders as possible (NHGRI Large‐Scale Sequencing Program May 2010, http://www.genome.gov/).

Exome sequencing of Mendelian disorders

Sequencing of unrelated individuals

The advent of the exome sequencing approach has immediately overcome the major obstacles in studying extremely rare Mendelian disorders and de novo variants. This proof-of-concept was demonstrated by Ng et al. (2009) in Freeman–Sheldon syndrome. Only four unrelated cases were subjected to exome sequencing and MYH3 was identified as the single candidate gene harboring at least one nonsynonymous variant, splice-site disruption or coding indel in all cases. The causal variants identified in MYH3 were previously unidentified, i.e. neither cataloged in dbSNP nor present in exome sequencing data of eight HapMap samples. Although MYH3 is a known candidate gene for Freeman–Sheldon syndrome, this study showed the feasibility of applying exome sequencing to identify the candidate gene for a Mendelian disorder despite a small number of unrelated cases (Ng et al. 2009).

As tens of thousands of single nucleotide variants and short indels have been detected in the sequencing of the human exome, multiple robust filtering criteria needed to be applied to discern the causal variants. These filters included first identifying the genes with one or more nonsynonymous variants, splice-site disruptions or coding indels in the exomes of the four individuals with Freeman–Sheldon syndrome (which assumed no genetic heterogeneity among the cases) investigated by Ng et al. (2009) and excluding those common variants as they were less likely to be causative (Table 1) (Ng et al. 2009). These filters have proven effective in identifying the known candidate gene for Freeman–Sheldon syndrome.

Nonetheless, for other Mendelian disorders, the identification of the causal variant or candidate gene is not as straightforward as demonstrated in Freeman–Sheldon syndrome. The exomes of ten unrelated individuals affected with Kabuki syndrome were also sequenced by the same group of researchers (Ng et al. 2010a). However, after applying the same filtering strategies, the study failed to identify a compelling candidate gene whose previously unidentified variants were seen in all the individuals. This result suggests the presence of genetic and phenotypic heterogeneity underlying the disorder. To account for genetic heterogeneity, a less stringent strategy was applied by looking for candidate genes shared among subsets of affected individuals. Additionally, various ranking and stratifying steps were also taken into account for phenotypic heterogeneity. These additional strategies finally led to the identification of causal variants in the MLL2 gene (Table 1) (Ng et al. 2010a).

Similar to other disorders such as Sensenbrenner syndrome, only two out of eight individuals had causal variants in WDR35. These causal variants were only identified in two unrelated cases with a strikingly similar phenotype. No causal variant in the gene was identified in the other six patients presenting with additional clinical phenotypes, and these patients did not show the striking phenotypic similarity as observed between the first two patients in the discovery study (Gilissen et al. 2010). This highlights the complexity of genetic and phenotypic heterogeneity and implies that classifying the phenotypic heterogeneity (by focusing on a very similar phenotype) helps in identifying the causal variants.

Further studies have also identified a number of novel candidate genes harboring causal variants for disorders such as Miller syndrome (Ng et al. 2010b), Fowler syndrome (Lalonde et al. 2010), Perrault Syndrome (Pierce et al. 2010) and Schinzel-Giedion syndrome (Hoischen et al. 2010) (Table 1). Of particular interest is the candidate gene identified for Fowler syndrome. A compound heterozygote of two variants in FLVCR2 was identified for each of the two cases, and thus a total of four different variants were identified in this gene (Lalonde et al. 2010). Compound heterozygotes in HSD17B4 and DHODH was also found for a number of individuals with Perrault syndrome (Pierce et al. 2010) and Miller syndrome (Ng et al. 2010b), respectively. Compound heterozygote refers to the presence of two different heterozygous variants occurring in two distinct positions in the homologous chromosomes (i.e. one variant in maternal chromosome and the other variant in paternal chromosome). Therefore, these deleterious variants can result in a recessive disorder even in the heterozygote state (Fig. 1). The exome sequencing studies have applied various strategies to identify the causal variants for different disorders, and some studies have integrated exome sequencing data with linkage and homozygosity analysis (Table 1).

Fig. 1
figure 1

This schematic diagram illustrates the concepts of ‘compound heterozygote’ and ‘de novo variant’. The red and yellow stars represent two deleterious variants in two different positions in a gene. The red variant is passed on to the child from the father and the yellow variant from the mother. Each deleterious variant is in a heterozygote state. In the child, both copies of the gene are defective due to the presence of red and yellow variants. Therefore, even in a heterozygous state, these deleterious variants can result in a recessive disorder. This is different from a homozygote variant (gray star) causing a recessive disorder. Therefore, recessive disorders can be caused by (1) two different heterozygous variants (compound heterozygote) and (2) two similar variants (homozygote). The green star represents a de novo variant which is absent in paternal and maternal chromosomes

Sequencing of family members

In addition to the previously unexplained Mendelian disorders, exome sequencing has also identified novel causal variants and candidate genes for disorders which have been studied previously, for example autosomal-dominant spinocerebellar ataxias. To date, causal variants in 19 genes have been identified for this disorder. Recently, a causal variant in an additional gene (TGM6) was revealed through exome sequencing (Wang et al. 2010). However, instead of sequencing unrelated individuals from different families as demonstrated in other studies (Table 1), the investigators performed exome sequencing in four affected individuals in one four-generation Chinese family with autosomal-dominant spinocerebellar ataxias. Although this study also applied almost similar variant filtering strategies as with other studies of unrelated cases, comparison of the exome data among the four cases to find the shared variant was sufficient to identify TGM6 as the sole candidate gene containing a new nonsynonymous variant in exon 10 of this gene.

This study highlighted the advantage of sequencing multiple affected individuals from one family, because it allowed the investigators to hypothesize that all affected individuals should share the same causal variant, as spinocerebellar ataxia was inherited in an autosomal-dominant pattern in this family. The finding from exome sequencing was also supported by linkage analysis where the causal variant was found in a region revealed by linkage analysis. Spinocerebellar ataxias are also characterized by clinical and genetic heterogeneity which would benefit from exome sequencing. The sequencing of affected individuals in one family would offer further advantage to the study design (Wang et al. 2010) as unrelated cases from different families are likely to have causal variants in different genes. This study highlighted the advantage of exome sequencing in affected family members, as compared with unrelated individuals, in identifying causal variants for clinically and genetically heterogeneous disorders. Other studies have also performed exome sequencing in multiple siblings and identified causal variants and candidate genes for disorders such as autosomal-dominant amyotrophic lateral sclerosis (Johnson et al. 2010b), familial combined hypolipidemia (Musunuru et al. 2010) and hyperphosphatasia mental retardation syndrome (Krawitz et al. 2010).

Integration with homozygosity mapping

Exome sequencing has also been swiftly integrated with homozygosity mapping to accelerate the investigation of recessive disorders in consanguineous families (Walsh et al. 2010; Anastasio et al. 2010; Sirmaci et al. 2010; Bolze et al. 2010). Bolze et al. (2010) have demonstrated the advantages of integrating both approaches in identifying causal variants for a clinical syndrome that has never been described previously (Table 1). Homozygosity mapping was performed in three patients and their parents, which identified two homozygosity regions in chromosome 11 and 18, respectively. In parallel, the exome of one patient was sequenced identifying 23,146 variants; however, only 67 variants and 14 variants were found in the homozygosity region in chromosome 11 and 18, respectively. The availability of homozygosity data has allowed the investigators to substantially narrow down the search space to less than 100 variants from the exome data. The subsequent comparisons with SNP databases identified only one nonsynonymous variant that was previously unreported in the homozygosity region in chromosome 11 and was located in exon 2 of FADD. The filtering and identifying of causal variants have been greatly facilitated by integration with homozygosity mapping data (Bolze et al. 2010).

Diagnostic application

Exome sequencing is also a useful tool for diagnostic application. The genetic diagnosis of congenital chloride diarrhea in a patient was made through exome sequencing revealing a homozygous missense variant in SLC26A3. The position of this variant is completely conserved from invertebrates to humans (Choi et al. 2009). However, other studies have adopted whole-genome sequencing as a diagnostic application. For example, it was applied to an 11-month-old patient with severe hypercholesterolemia and identified approximately 3.8 million variants where only 9,726 were nonsynonymous variants and of which 699 were new. The defective gene ABCG5 was identified because it had two nonsense variants (Rios et al. 2010). The diagnostic application was further illustrated by Lupski et al. (2010) through whole-genome sequencing of a proband with Charcot–Marie–Tooth disease. However, this study only focused on those genes known to cause the neuropathic condition. One missense variant and one nonsense variant were detected in SH3TC2, and all affected individuals in the family of the proband were found to be compound heterozygotes for these variants (Lupski et al. 2010). Although whole-genome sequencing was done in some studies (Rios et al. 2010; Lupski et al. 2010), exome sequencing would have been sufficient to identify the causal variants and genes for severe hypercholesterolemia and Charcot–Marie–Tooth disease. The cost of a diagnostic test would be an important factor to consider for clinical utility. Exome sequencing is anticipated to be used increasingly in molecular diagnosis (Bonnefond et al. 2010; Worthey et al. 2010; Montenegro et al. 2011).

Pros and cons of variant filtering strategies

There are two important assumptions underlying the variant filtering strategies of these exome sequencing studies: (a) causal variants for Mendelian disorders would be rare and therefore likely to be previously unidentified in public databases or control sequencing data and (b) synonymous variants would be far less likely to be causative. However, several caveats must be noted.

The filtering of common variants in the exome by comparison with public databases such as the dbSNP, the HapMap Project, the 1000 Genomes Project and other exome sequencing data is of benefit. This has proven effective in removing a substantial number of less likely causal variants (Table 1) as the causal variants for extremely rare Mendelian disorders should be ‘very rare’. In addition, de novo variants are also rare, occurring in a heterozygote state for dominant disorders. This simple assumption and filtering strategy offers an advantage to quickly sift through the exome data for promising causal variants. However, the removal of common variants by comparison to the dbSNP has a weakness due to the considerable fraction of false-positive errors in the dbSNP. Currently, more than 17 million SNPs in the human genome have been documented in the dbSNP with a false-positive rate of 15–17% estimated for the database (Day 2010). Therefore, some important variants in the exome may be discarded. This problem is likely to be overcome by a more accurate database upon the completion of the 1000 Genomes Project. However, with the continuous cataloging of rarer variants in the human genome, an ‘optimal’ cutoff of frequency needs to be imposed to distinguish between what constitutes the ‘common variants’ that are less likely to be causative compared to ‘rare variants’ that need to be retained for analysis.

The exome sequencing studies have focused on nonsynonymous and nonsense variants, splice-site variants and frameshift indels (collectively known as deleterious variants) and ignored synonymous variants which are far less likely to be deleterious (Table 1). By discarding synonymous variants, the number of variants is substantially reduced for downstream analysis. However, in the event that some cases are unexplained by deleterious variants, it is not immediately clear whether synonymous variants are causative for the unexplained cases. Nonetheless, it is currently unclear how best to incorporate the synonymous variants into an analysis with deleterious variants robustly and efficiently to identify candidate genes for Mendelian disorders.

Exome sequencing versus whole-genome sequencing

In this section we discuss the advantages and pitfalls of exome sequencing in comparison to whole-genome sequencing. Since the exome constitutes only approximately 1% of the human genome, it requires a lesser amount of sequencing data to achieve the desired depth of sequencing coverage to accurately detect variants compared with whole-genome sequencing. For example, 138 gigabases of mappable sequence data were generated and achieved an average coverage of 49× in the whole-genome sequencing of a patient with severe hypercholesterolemia (Rios et al. 2010). In contrast, <4 gigabases of mappable sequence data were generated and achieved an average coverage of 59× and 67× in the exome sequencing of two patients with Sensenbrenner syndrome (Gilissen et al. 2010).

Furthermore, most of the known causal variants for Mendelian disorders were found in exons. The exome has been the focus of studies for Mendelian disorders because nonsynonymous variants leading to amino acid changes affect the function of the protein and nonsense variants producing truncated protein are significantly deleterious to cause Mendelian disorders. In addition, small indels in exons can adversely affect the amino acid sequence through frameshift reading of the codons. Variants in splice sites can affect the mRNA stability and alternative splicing. Although whole-genome sequencing was performed in some studies (Rios et al. 2010; Lupski et al. 2010), these analyses still focused on the variants in exons.

However, whole-genome sequencing offers an advantage to study other genetic variants besides deleterious variants in exons. The paired-end sequence reads generated by whole-genome sequencing are useful for the detection of various structural variants or chromosomal rearrangements in the genome which collectively become the second source of genetic abnormalities responsible for Mendelian disorders (Lupski and Stankiewicz 2005; Chen et al. 2010a). Several structural variant detection methods such as paired-end mapping and depth-of-coverage are developed by leveraging on the high-density short sequence read data generated by NGS technologies (Korbel et al. 2007; Yoon et al. 2009; Medvedev et al. 2009). Preparation of several DNA fragment libraries with different sizes coupled with these sequencing-based detection methods have demonstrated to be powerful enough to detect different structural variants of varying sizes. The application of the mate-pair sequencing method to identify copy number variants was demonstrated in the whole-genome sequencing study of Charcot-Marie-Tooth disease. In parallel, the study also used a comparative genomic hybridization (CGH)-based array and identified a total of 234 copy number variants. However, none of the copy number variants affecting genes was known to be involved in Charcot-Marie-Tooth disease (Lupski et al. 2010). Although high-resolution oligonucleotide CGH or SNP microarrays can be used to supplement exome sequencing, these microarray-based methods are only able to detect copy number changes, whereas inversions, translocations and other more complex chromosomal rearrangements are beyond their detection (Carter 2007). The use of these microarrays will also add to the cost of the exome sequencing study.

To ensure a more thorough interrogation of both deleterious single nucleotide variants in exome and structural rearrangements, the cost of an ‘exome sequencing study’ will be comprised of spending on sequence capture methods, exome sequencing and CGH or SNP microarrays. This also means that three laboratory experiments are needed and two sets of data (sequencing and microarray data) will be generated. Opting for exome sequencing is mainly driven by the cost advantage. However, given the decreasing cost of whole-genome sequencing, the price gap between the two approaches is becoming smaller. Despite whole-genome sequencing being more costly, it has greater value in that data of the whole-genome are obtained compared with 1% of the genome from exome sequencing. Furthermore, only a few samples are usually studied in exome sequencing for Mendelian disorders, unlike complex diseases that require hundreds to thousands of samples. Thus, the difference in cost between the two sequencing approaches will only be multiplied by several samples.

Exome sequencing studies have, without a doubt, identified causal variants and candidate genes for a number of Mendelian disorders (Table 1); nevertheless, a subset of cases for some of the disorders remain unexplained. There are several reasons for this. The capture of the entire collection of exons in the human genome using the available sequence capture methods is by no means complete; thus, variants in the missing exons cannot be studied. Furthermore, non-coding regions (introns and intergenic regions) are not considered in exome sequencing. It is still unclear whether synonymous variants or variants in non-coding regions or deleterious variants in other genes are responsible for the unexplained cases (Cooper et al. 2010; Chen et al. 2010b). In contrast, whole-genome sequencing studies do not have the problem of ‘missing exons’ as a result of incomplete capture. Furthermore, the variants in highly evolutionary conserved non-coding regions can be readily explored for unexplained cases (Dermitzakis et al. 2005). Many causal variants identified for Mendelian disorders were located in protein sequences which are highly conserved throughout evolution (Table 1). This could also hint at the importance of investigating variants in evolutionary conserved non-coding regions where important functional elements were found (Dermitzakis et al. 2005; Alexander et al. 2010).

It is anticipated that the >3 million single nucleotide variants detected in whole-genome sequencing will create additional challenges to identifying causal variants and thus more robust filtering strategies are needed. However, most of the common and less likely causal variants should be removed efficiently with the data from the full completion of the 1000 Genomes Project. Although the ‘whole-genome data’ are generated, investigators can still focus on and prioritize the variants in the exome for first-tier analysis. The remaining variants can be used in subsequent tiers of analysis. This strategy was also applied in whole-genome cancer sequencing (Ley et al. 2008). If it appears that those variants in evolutionary conserved non-coding regions or regulatory sequences are also causative or acting as modifiers affecting the severity of disorders, then the exome-sequenced samples may need to be resequenced at the whole-genome level. Identifying the variants acting as modifiers will help in better understanding of phenotypic heterogeneity, but this will be challenging (Génin et al. 2008).

Currently, the sequencing data for Mendelian disorders is still rudimentary; it is difficult to be convinced that the variants in the remaining 99% of the genome are not ‘important’ to these disorders either as causative variants or modifiers (Cooper et al. 2010; Chen et al. 2010b; Dermitzakis et al. 2005). It was previously believed that 99% of the genome consisted of ‘junk DNA’ because these regions did not encode proteins. The functional importance of the ‘junk DNA’ was eventually discovered (Castillo-Davis 2005; Alexander et al. 2010). We hope that more knowledge and understanding will be gained through exploration of variants in the whole genome.

Summary and future direction

In summary, exome sequencing has now been applied in multiple situations where (a) several affected siblings in a family, (b) several unrelated cases and (c) sporadic cases are available for analysis where the causal variants for a number of Mendelian disorders have been successfully identified (Table 1). In addition, exome sequencing has also been shown to be more robust to study disorders with genetic and phenotypic heterogeneity. It has also proved viable to study Mendelian disorders if only a single case is available. In addition, de novo causal variants have also been successfully identified for sporadic cases. Exome sequencing has also been demonstrated as a powerful tool in diagnostic application. Integration with linkage and homozygosity data has greatly facilitated the discovery of causal variants and candidate genes for Mendelian disorders.

The number of causal variants and candidate genes identified for Mendelian disorders is anticipated to grow rapidly through individual researchers and large-scale collaborative efforts. The cost of exome sequencing studies is now more affordable and only a few exomes need to be sequenced. Although exome sequencing studies have provided compelling evidence that the identified variants are causative for Mendelian disorders, mutagenesis and animal model studies will still be needed to lend further support to the causality and to demonstrate the effect that causal variants have on the phenotypic level.

Exome sequencing with sufficient depth of coverage has generated high-quality data for single nucleotide variant detection. However, it is difficult to detect indels with short sequence reads generated by NGS technologies. For example, frameshift indels in two individuals with Kabuki syndrome were undetected by exome sequencing, but were successfully identified using Sanger sequencing (Ng et al. 2010a). Furthermore, exome sequencing is unable to detect structural variants or chromosomal rearrangements which are believed to be important for Mendelian disorders as well. These, together with the problem of incomplete exome capture and the potential reward from interrogating non-coding regions, especially the highly evolutionary conserved regions, have sparked a debate on whether whole-genome sequencing is needed. However, this will likely be a non-issue in the next few years when the cost of whole-genome sequencing becomes cheaper. Similar to other fields such as cancer genome sequencing (Ley et al. 2008; Pleasance et al. 2010; Lee et al. 2010) and studies of human genetic variants (Bentley et al. 2008; Wheeler et al. 2008; Wang et al. 2008), research on Mendelian disorders will also benefit tremendously from de novo genome assembly when it becomes feasible with better assembly algorithms and longer sequence reads generated by third-generation sequencing technologies (Li et al. 2010a; Schadt et al. 2010).