Introduction

Intellectual disability (ID) (intelligence quotient < 70, with concordant deficits in adaptive function) comprises a highly heterogeneous group of disorders with an overall prevalence of ~1–3% (Chelly et al. 2006; Roeleveld et al. 1997). Genetic abnormalities are believed to be a major cause of ID. Up to 30% of cases were reported to be associated with chromosomal abnormalities (Knight et al. 1999) and hundreds of genes have been implicated (Kramer and van Bokhoven 2009). However, the etiology of ID still remains unknown in about half of all cases.

With the introduction of array-based comparative genomic hybridization (array CGH), which allows high-resolution whole genome analysis, an additional 5–15% ID patients are found to carry pathogenic submicroscopic copy number variants (CNVs) (for review, see Koolen et al. 2009). With the increasing resolution and better understanding of CNV significance, array CGH is becoming an accepted tool in clinical genetics laboratories for finding the cause of ID. Compared to conventional cytogenetic and linkage analyses, array CGH greatly narrows down the potential pathogenic loci and is a promising tool for uncovering and cataloging ID candidate genes (Vissers et al. 2005). This is exemplified by the discovery of the CHD7 gene in CHARGE syndrome (Vissers et al. 2004), mutations of which were identified after a sub-microscopic deletion was noted by array CGH. However, most of the CNVs encompass multiple (up to 100) genes, which makes it difficult to identify the key disease gene(s) for ID.

The availability of high throughput genome-wide datasets has facilitated and encouraged the rapid development of bioinformatics methods to prioritize positional candidate gene selection (Kanehisa and Bork 2003; Stein 2003). These methods are based on single or multiple data sources (see examples in Table 1). To minimize the bias from different single prioritization web tools, which also rely on different statistical methods, a combination of multiple web tools tends to be used for disease gene prediction (Elbers et al. 2007; Huang et al. 2008; Liu et al. 2008; Teber et al. 2009; Thornblad et al. 2007; Tiffin et al. 2006, 2008). Similarly, single gene prioritization web tools, which combine a large number of datasets, have been established. For example, the Endeavour prioritization web tool combines 26 datasets including gene ontology, literature, protein–protein interactions, sequence, gene expression datasets, etc. (Aerts et al. 2006). The prioritization studies, thus far have been mainly used for predicting and prioritizing the most promising candidate genes from genomic regions identified by linkage analyses and association studies in complex traits/diseases including type 2 diabetes and obesity (Elbers et al. 2007; Teber et al. 2009; Tiffin et al. 2006), osteoporosis (Huang et al. 2008; Liu et al. 2008) and metabolic syndromes (Tiffin et al. 2008). The application of web tools for gene prioritization in chromosomal regions identified by array CGH has only begun to be explored, with Endeavour being used as a single web tool (Osoegawa et al. 2008; Qiao et al. 2009; Yonan et al. 2003). More recently, the Ingenuity pathway analysis (IPA) web tool was used to help assess the candidate genes from the 1p34 array-detected microdeletion in a subject with autism (Kumar et al. 2009). Systematic analysis of the potential advantages and limitations of using different prioritization web tools to candidate gene identification in subjects with ID therefore remains largely unexplored.

Table 1 Summary of Web site tools for candidate gene prioritization

In this study we used a selection of web tools to prioritize the genes within unique de novo, familial and common CNVs detected using array CGH in 255 subjects with ID. Assessment of the function of the prioritized genes was performed using multiple databases including the mouse knockout phenotype database and other gene function analysis web tools. Such approaches are expected to help in the evaluation and understanding of the contribution of positional candidate genes involved in CNVs and their relationship to ID.

Materials and methods

Subjects

A total of 255 subjects with idiopathic ID were recruited for array CGH analysis by clinical geneticists across Canada. The criteria for selecting the cases included: (i) normal karyotypes by routine cytogenetic testing at the 500–550 band level resolution; (ii) negative fragile X testing by DNA analysis; (iii) a phenotype score ≥3 on a testing prioritization checklist adapted from de Vries et al. (2001); and (iv) both parents available for testing.

As much as 29 out of the 255 cases were reported as individual or small group cases (Gibson et al. 2008; Harvard et al. 2005; Rajcan-Separovic et al. 2007; Tyson et al. 2004,2005). Three array platforms were applied to screen the subjects: 1 Mb BAC array on 141 subjects, Agilent 105K Oligo array on 96 and NimbleGen array on 18 cases.

Array CGH

Genomic DNA was extracted from peripheral blood using PUREGENE DNA Isolation Kit (Gentra, Minneapolis, MN). A pool of normal male or female control DNAs (Promega, Madison, WI) was used as reference DNA matching the sex of the proband samples.

1 Mb resolution BAC array CGH

BAC array CGH was performed as previously described (Rajcan-Separovic et al. 2007). Briefly, sample and reference DNAs were hybridized to the 1-Mb BAC array (Spectral Genomics, Houston, TX) using dye swap methods. Data analysis was performed using Spectralware 2 software (Spectral Genomics). Identification of clones with a significant gain or loss was based on previously established cutoff values of 1.2 and 0.8, respectively (Tyson et al. 2005).

High-resolution oligonucleotide array CGH

Agilent 105K array analysis was performed according to the protocol provided by the company (version 4.0, June 2006, Agilent Technologies, CA, USA) (Fan et al. 2007). Feature Extraction software (version 8.1.1.1, Agilent Technologies) rendered image analysis using the manufacturer’s recommended settings (CGH_-v4_95) and human genome assembly hg18. The minimum absolute average of log2 ratio was 0.25.

Higher-resolution 385K oligonucleotide genome array CGH was performed by courtesy of NimbleGen. Array log2 ratio >±0.2 was used for a segmentation (region). For both the Agilent and Nimbelgen array platforms, ≥3 consecutive probes were required for a significant CNV call. CNVs that overlapped in genomic coverage were considered to represent the same CNV loci.

Criteria for interpreting CNVs

The criteria for the interpreting a CNV as unique or common have been described previously (Qiao et al. 2008). Briefly, CNVs overlapping with CNVs reported in at least two studies in the Database of Genomic Variants (DGV) or in our internal controls (Qiao et al. 2008) were considered common CNVs; those that overlapped partially (<50%) or did not overlap with CNVs reported in the DGV or our internal controls were called unique. Unique CNVs of de novo origin were considered pathogenic and unique CNVs of familial origin were considered putatively pathogenic. All unique CNVs were confirmed by FISH or custom array CGH and their parental origin determined using the same methods.

Fluorescence in situ hybridization (FISH)

FISH analyses were performed using the BAC DNA clones from the CNVs, as described previously (Rajcan-Separovic et al. 2007). Slides were viewed on a Zeiss Axioplan 2 fluorescence microscope and images captured using MacProbe software (Applied Imaging, Santa Clara, CA).

Custom oligonucleotide array CGH

For the validation of two maternally inherited abnormalities (duplications of 1p34.1 and 20q13.12), for which FISH testing was not possible, custom arrays were designed using eArray (Agilent technologies) and the ADM-2 algorithm as described previously (Rajcan-Separovic et al. 2010).

Bioinformatics analysis

Web tools for gene prioritization

Five freely accessible gene prioritization web tools were selected: Endeavour (Aerts et al. 2006), GeneWanderer (Kohler et al. 2008), PosMed (Yoshida et al. 2009), Suspect (Adie et al. 2006) and ToppGene (Chen et al. 2009) (Table 1). These five web tools use different data sources and require different inputs. Three of them (Endeavour, GeneWanderer and ToppGene) require a user-defined training-gene set (i.e. known disease genes), while two (PosMed and Suspects) automatically create their own “training”-gene set based on entering a phenotype specific term; in our case it was mental retardation.

Training-gene sets

For the three web tools that required user-defined training-gene sets, we selected six ID-related disease training-gene sets (ID 1–6) extracted from OMIM, Ensemble, Decipher, Suspects, GenTrepid, databases and an in-house ID training set which contained genes selected manually from ID-related publications (see details in Supplementary Table 1). In addition, six random training-gene sets (R1–R6) containing genes randomly selected from the whole human genome were used as negative controls (Supplementary Table 1). These genes were selected randomly using a random number generator from the “known genes” listed in the UCSC GoldenPath database (hg18) (http://genome.ucsc.edu/).

Overview of computational analysis (Fig. 1)

We first selected seven de novo CNVs (6–97 genes/CNV) for a pilot study to compare the outcome of gene prioritization using three web prioritization tools individually (Endeavour, GeneWanderer and ToppGene) with different training-gene sets. Next, we compared the function of prioritized and non-prioritized genes from 14 de novo CNVs (>15 genes/CNV); the genes were prioritized using a single web prioritization tool or five tools in concert. Finally, the function of the prioritized genes obtained using all five web tools in concert was compared between all de novo, familial and common CNVs.

Fig. 1
figure 1

Overview of the study

Creation of priority lists and their analysis

  1. (i)

    7 pilot CNVs: The priority list for a CNV consisted of the top five genes obtained using three web tools individually (Endeavour, ToppGene and GeneWanderer). As each individual web tool was used with six ID-specific and six random training sets separately, there were 12 web tool-specific priority lists for each tested CNV. In addition, as shown for one CNV from 9qter (Suppl Table 2), genes from the six priority lists obtained using six ID-specific training sets were ranked and the average of the rank for each gene was created. This “averaged ID priority list” is termed AP1 and contained, for each CNV, the top five genes with highest averaged rank. Similarly, for each CNV the genes from the six priority lists obtained using six random training sets were ranked and the average of the rank for each gene was used to create the averaged random priority list (AP2). Therefore, each CNV had 14 priority lists for each of the three tools: 12 individual gene priority lists obtained with six ID training sets and six random training sets, as well as AP1 and AP2. The corresponding priority lists were pooled for the seven CNVs per web tool (e.g. AP1 from CNV1 was pooled with AP1 from CNV2 and AP1 from CNV3, etc. for Endeavour). This created 14 web tool-specific pooled priority lists. These web tool-specific pooled priority lists were compared for gene overlap within each of the web tools (e.g. AP2 was compared with the remaining 13 priority lists for Endeavour). Also, the overlap of pooled priority list between any two web tools was determined (e.g. AP1 obtained with Endeavour was compared with AP1 obtained with Toppgene). Finally, overlap between AP1 lists and AP2 lists obtained by three tools was determined (e.g. AP1 obtained using Endeavour was compared to AP1 obtained using Toppgene and to AP1 obtained using Gene wanderer).

  2. (ii)

    14 CNVs: Five web tools in concert and one single web tool (Endeavour) were applied for prioritization of genes from 14 CNVs. For the “five tools in concert” approach the final priority list for a CNV consisted of genes which were among the top five in the priority lists obtained with at least two prioritization web tools (see an example for prioritization using five tools for CNV from 9qter in Supplementary Table 2). The pool of prioritized genes for 14 CNVs was compared in terms of gene function to non-prioritized genes from the same 14 CNVs using Ingenuity pathway analysis (IPA: http://www.ingenuity.com) software. IPA uses known protein–protein and gene–gene interactions in combination with multiple other data sources including differential expression from the microarray data.

  3. (iii)

    All CNVs: IPA software was used to compare the function of prioritized genes between three classes of CNVs as pools (de novo, familial and common) obtained using five web tools in concert with one training set (OMIM).

We also used three other publicly available web tools for pathway analysis of prioritized genes as a pool from all classes of CNVs (de novo, familial and common). These included;

  1. (1)

    WebGestalt (WEB-based GEne SeT AnaLysis Toolkit; http://bioinfo.vanderbilt.edu/webgestalt/) (Zhang et al. 2005);

  2. (2)

    Pathway-Express (http://vortex.cs.wayne.edu/Projects.html) (Draghici et al. 2003) and;

  3. (3)

    GATHER (http://gather.genome.duke.edu/) (Chang and Nevins 2006).

STRING (Search Tool for Retrieval of Interacting Proteins: http://string.embl.de) was used to identify and compare known and predicted protein–protein interactions based on comparative genomics and text mining (von Mering et al. 2007).

Other public databases

The Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org, version 3.54) (Bult et al. 2008; Eppig et al. 2007) was used to determine if human ID genes have mouse orthologs which show neurodevelopmental phenotypes resulting from the homozygous and/or hemizygous disruption of the gene(s) in knockout mouse models. The mouse phenotypes reviewed for prospective roles in ID etiology involved those affecting the nervous system, embryogenesis, as well as behavioral, neurological and craniofacial phenotypes.

The GeneImprinting website database (http://www.geneimprint.org/site/genes-by-name) was used to determine if genes, prioritized from familial inherited CNVs may be imprinted. Selective parental, monoallelic expression could explain the presence of the abnormal phenotype in an affected child that is absent in the transmitting parent (Lee et al. 2007).

Statistical analysis

All statistical tests were performed using the statistical computation Web site from Vassar College (http://faculty.vassar.edu/lowry/VassarStats.html). The chi-square and Fisher’s exact tests were used.

Results

Array CGH analysis

We identified 47 unique and previously unreported CNVs in 45 out of 255 subjects with ID (17.6%) tested using the three different array platforms (Table 2). Each CNV was confirmed either by FISH or by custom array analysis followed by parental analysis using the same methods: 21 CNVs were de novo (in 20 subjects or 7.8%) and 26 were familial CNVs (in 25 subjects or 9.8%) (Table 2). The detection rate of de novo CNVs was higher (9.4 and 11.1%) with the higher-resolution arrays (Agilent 105K and Nimblegen 385K) compared to the low-resolution array CGH (6.4%, SG). Deletions were more frequent for de novo (13/21) than familial (1/26) CNVs (p < 0.0001). Six de novo CNVs were found to overlap with known genetic microdeletion/duplication syndromes (Table 3). The CNV sizes range from 680 kb to 9.7 Mb for de novo (average 3.0 Mb) and 31 kb to 1.7 Mb for familial CNVs (average 0.6 Mb). The de novo CNVs contained 1–97 genes/CNV, while the familial CNVs contained 0–21 genes. In total 595 and 116 non-redundant genes were found to be involved in the 21 de novo and 26 familial CNVs, respectively (Table 3). The 17 most frequently recurring common CNVs (detected in >10% of ID subjects on each of the three array platforms (Qiao et al. 2008)) and overlapping with CNVs from the database of genomic variants (http://projects.tcag.ca/variation/) are also listed in Table 3. They contained a total of 108 genes.

Table 2 Summary of array CGH findings
Table 3 List of de novo, familial and most common CNVs and prioritized genes

Candidate gene prioritization in a selection of seven pilot CNVs—impact of training set and gene prioritization web tool on ID candidate gene priority lists

In our preliminary prioritization experiment, we selected seven de novo CNVs of various sizes (0.5–7.9 Mb) and number of integral genes (between 6 and 97) (marked in Table 3) to test the effect of training-gene sets on priority lists obtained with each of the three web tools (Endeavour, GeneWanderer and ToppGene) that require user-defined training sets. Overlap between AP2 (averaged random priority list) pooled for seven CNVs with pooled priority lists obtained using ID training set 1–6 ranged from 46 to 77% for the three tools used (Fig. 2a). It was even higher when the individual priority lists, obtained using six random training sets, each pooled for the seven CNVs, were compared with AP2 lists pooled for seven CNVs (83–97%). The rate of overlap between the priority lists within the web tool depended on the number of genes/CNV as CNVs with more genes (>15 genes/CNV) showed less overlap between the averaged priority lists AP1 and AP2 (65–75%) compared to smaller CNVs with <15 genes/CNV (87–100%, Table 4). The >50% overlap between the priority lists obtained using random and ID training-gene sets for the seven CNVs suggests limited disease specificity for the tools regardless of the training-gene set.

Fig. 2
figure 2

Overlap of gene priority lists. A The percentage of overlap presented for each web tool separately and obtained by comparing the averaged random priority lists (AP2) pooled for seven CNVs to each of the 13 priority lists pooled for seven CNVs. These 13 lists were based on using ID training sets 1–6, random training sets 1–6 and the averaged ID priority list (AP1). B Overlap of averaged ID priority list (AP1) or averaged random priority list (AP2) (pooled for seven CNVs) between any two out of the three web tools

Table 4 Rate of overlap between the averaged priority lists (AP1 and AP2) from seven de novo CNVs

The results of the prioritization were further investigated by assessing overlap between prioritized genes in AP1 and AP2 lists (each pooled for seven CNVs) among the three tools. A low number of genes in common was noted when any two of the three tools were compared and ranged between 26 and 45% (Fig. 2b). Similarly, there were few genes common to all three tools within pooled AP1 and AP2 list (5/76 and 6/72) (Fig. 3). This indicated a significant discrepancy in prioritization results among the different tools. The majority of these common genes (4) was the same between AP1 and AP2 priority lists (Fig. 3). Most of these common genes (3/4) were well-investigated disease-related genes (CRHR1 (Varela et al. 2006), MAPT (Koolen et al. 2008), MBD5 (Jaillard et al. 2009)), suggesting that both the random and the ID-specific training sets tend to prioritize known disease genes.

Fig. 3
figure 3

The number of prioritized genes overlapping between three web tools. a The number of genes from AP1 lists pooled for seven CNVs detected with each of the three web tools and their overlaps. The five genes detected by all web tools are: CRHR1, EHMT1, EPC2, MAPT and MBD5. b The number of genes from AP2 lists pooled for seven CNVs detected with each of the three webtools and their overlaps. The six genes detected by all web tools are: ACBD5, CRHR1, EPC2, MAPT, MBD5 and MPP7

Candidate gene prioritization in a selection of 14 CNVs—comparison of function of prioritized and non-prioritized genes

Based on the observation that only 26–45% of genes overlap for any two tools that require training sets (Fig. 2b), we prioritized genes from 14 larger CNVs (>15 genes) using five tools in concert and compared them with gene priority lists obtained with one individual tool as well as with non-prioritized genes. We selected Endeavour as a single tool, because it captures the largest number of data sources and utilizes more flexible input forms with validated in vivo experiments on a prioritized gene (Tranchevent et al. 2008). Prioritized gene lists were obtained using only one of the ID-specific training sets (OMIM) and one random training set (R1). The OMIM training set was selected because it can be automatically downloaded from the frequently cited and updated Web site with the option of selecting genes with known sequence and phenotypes (Supplementary Table 1).

The outcome of the prioritization was evaluated by analysis of the function of the prioritized genes using IPA software. The gene function comparison was made between the initial non-prioritized 491 genes derived from the 14 CNV loci to pooled prioritized genes from these CNVs obtained with the above described approaches (Endeavour vs. five gene prioritization web tools in concert; OMIM vs. random set). Our results show that genes in the IPA category with biological function were significantly more prevalent after prioritization using Endeavour with either OMIM disease training-gene set (89%) or R1 random training-gene set (78%) compared to non-prioritized genes (53% had biological function) (both p < 0.001; Table 5). Similarly, enrichment for genes in the IPA category “with nervous system development and function” was noted both with OMIM and with random training set (19 and 13%, respectively) compared to non-prioritized genes (7%); however, it was significant only after prioritization using Endeavour with the OMIM training-gene set (p = 0.004), and not with random training-gene set R1 (p = 0.1).

Table 5 IPA analysis of the function of prioritized genes from a selection of 14 de novo CNVs

When the results of prioritization using five tools were combined as described previously (with OMIM disease training-gene set), the prevalence of genes with biological/nervous system function after prioritization was even higher, as 93 and 31% of prioritized genes had biological and nervous system functions, respectively, in comparison to 53 and 7% of non-prioritized genes (p < 0.0001; Table 5).

Candidate gene prioritization from all de novo, familial and most common CNVs including analysis of the function of prioritized genes

Based on these results, we used the same five tools in combination method and the OMIM disease training-gene set to prioritize the genes in each of our remaining CNVs. For CNVs with ≤5 genes, we “prioritized” all of them as ID candidate genes. In summary, 102 out of 595, 71 out of 116 and 51 out of 108 genes represented candidate ID genes from 21 de novo, 27 familial and 17 most common CNVs after prioritization (Tables 3, 6).

Table 6 IPA analysis of the function of prioritized genes from different CNV group

We used a number of databases and bioinformatics web tools to compare the function of genes prioritized from the different CNV groups. First, we applied the MGI database to determine if loss of function of any of the prioritized genes (due to either homozygous or hemizygous disruption in mouse models) caused mouse phenotypes. We found that 37 out of 66 (56%) prioritized genes from 13 de novo deletion CNVs had been investigated in knockout mouse model on MGI. Among them, 70% (26/37) were found to have annotations related to mouse knockout phenotypes including nervous system abnormalities, abnormalities during embryogenesis, behavior/neurological phenotypes, and/or craniofacial phenotypes (Table 3), suggesting an enrichment of genes with these functions in the priority list obtained for the de novo deletion CNVs. It is interesting that only ~25% of all human genes currently have mouse ortholog knockout data and 70% of our priority genes were among them. We could not assess the presence of prioritized genes from our familial and common CNVs in MGI phenotypes as were predominantly involved in duplications.

Next, we used IPA software to compare the enrichment of genes prioritized from different CNV groups. We found a significantly higher proportion of genes with biological and nervous system function in the de novo CNV subgroup compared to the familial or most common CNVs (Table 6; P < 0.05). The genes from familial CNVs identified to have a role in nervous system function were KCNE1 (Letts et al. 2000), CD9 (Doh-ura et al. 2000) and REG1B (Tebar et al. 2008). We also searched for evidence of imprinting for familial prioritized genes based on information from the Geneimprinting database (http://www.geneimprint.org/site/genes-by-name). Only one gene, GATM, showed imprinting in mouse that is of unknown imprinting status in humans. By searching gene ontology terms, four genes (DPF3, MYST4, SETMAR and SMYD3) were found to have chromatin modification functions in the familial CNV group, while three genes (ARID1A, EHMT1 and EPC2) with chromatin modification function were found in the de novo CNV group. No gene with imprinting or chromatin-related functions was found in the prioritized group from the most common CNVs.

To analyze the pathways involving the prioritized genes, we applied IPA software and three public web-pathway tools (WebGenstalt, Pathway-Express and Gather) (Supplementary Table 3). The significant enrichment of prioritized genes from de novo CNVs in neuroactive ligand-receptor interaction and MAPK signaling pathways was detected by two pathway tools (Webgestalt and Pathway express). The prioritized genes from our de novo CNVs involved in these two pathways are listed in Table 7. We also tested the six ID training-gene sets for participation in pathways and, in addition to the two pathways listed above, enrichment was detected in regulation of actin cytoskeleton development and axonal guidance by more than one web tool. No enrichment for any specific pathway was found among the random training-gene sets. Genes in the familial and most common CNVs were found to be more involved in pathways related to carbohydrate metabolism and immune responses (Supplementary Table 3).

Table 7 The two pathways enriched for genes prioritized from our de novo CNVs

Finally, we analyzed the network of interactions of prioritized genes from each of the three CNV groups and the OMIM training set using STRING (Fig. 4). We noted that the genes from the OMIM training set and de novo CNVs are more likely to have connections compared to familial and most common CNVs. There were 8/61 (13%) isolated genes (i.e. without any connection to other genes) in the OMIM set, 33/102 (32%) in the de novo, 44/71 (62%) in the familial CNVs and 26/51 (51%) in the common CNV set.

Fig. 4
figure 4

Interactions of prioritized genes in different CNV groups. a Prioritized genes from de novo CNVs. b Prioritized genes from familial CNVs. c Prioritized genes from most common CNVs. d All genes from OMIM training set

Discussion

We used three array CGH platforms to screen for unique CNVs in 255 subjects with ID. Our total detection rate of de novo CNVs (7.8%) is consistent with previously reported results (5.7–11%) (Koolen et al. 2009), and increased to 9.6% as the resolution of array platforms increased. As expected, the higher-resolution arrays detected smaller CNVs (the smallest being 31 kb). These small CNVs appear to be mostly familial in origin (for CNVs < 500 kb in size, 15/27 were from familial CNVs vs. 2/21 from de novo CNVs, Table 3), while the de novo CNVs tended to be larger (on average, 3.1 Mb for de novo vs. 0.6 Mb for familial CNVs) and contain a larger number of genes (30 genes per de novo CNV vs. 5 genes per familial CNV, on average, Table 2). Two exceptions are the de novo deletions of 2p13 (780 Kb) and 10q (1.6 Mb), which harbored two (CYP26B1 and EXOC6B) and one (ZWINT) gene, respectively. Disruption of CYP26B1 has been reported to affect neural crest formation and the knockout mouse showed abnormal craniofacial morphology (Maclean et al. 2009). EXOC6B is involved in brain exocytosis (Brymora et al. 2001), and ZWINT is highly expressed in the brain, localized extensively in primary hippocampal neurons (van Vlijmen et al. 2008). The remaining de novo CNVs contain at least six genes/CNV.

By using five gene prioritization tools, we prioritized the ID candidate genes from each unique de novo, familial and most common CNVs. We identified the candidate ID genes from each CNV separately in order to avoid the over-representation of genes from larger chromosomal regions with more genes and the under-representation of genes from small CNV loci with few genes, which could occur if prioritization of all genes is performed as a pool.

One of the challenges in using prioritization tools is the discrepancy in candidate gene selection resulting from using different web tools (Teber et al. 2009; Thornblad et al. 2007). For example, two studies by Tiffin et al. (2006) and Teber et al. (2009) used the same initial group of un-prioritized 9,556 positional candidate genes and slightly different web tools (six out of eight web tools were the same for the two studies) to predict candidate genes for type 2 diabetes (T2D) and the related trait obesity. They found that the number of predicted candidate genes obtained by each of the different web tools varied dramatically and no match was found from the candidate lists between the two studies. Thornblad et al. (2007) tested three web tools (PosMed, GeneSniffer and SUSPECTS) on four disorders (breast cancer, Crohn’s disease, age-related macular degeneration and schizophrenia) in which 10, 20 and 30 Mb segments of the chromosome containing the known susceptibility loci were tested. They found that the known disease gene(s) were not always in the top ranking list and were more likely to rank higher when selected from a narrower genomic region. A combination of multiple candidate gene prioritization web tools was recommended and applied in most of the recent studies (Huang et al. 2008; Teber et al. 2009; Thornblad et al. 2007; Tiffin et al. 2006). Our data also showed that the priority lists obtained using our five web tools in concert were more enriched in genes with biological and nervous system function compared to prioritized genes using a single web tool Endeavour (93% vs. 89% for genes with biological functions, p = 0.6; 31% vs. 19% for genes with nervous system functions, p = 0.2l; Table 5). However, as no statistically significant difference was found in the enrichments between the priority lists obtained using Endeavour and 5-tools in concert, the use of a single tool, such as Endeavour may still be considered, as it is more practical considering that creating a prioritization list using five web tools in concert is time consuming.

The effect of the training set on the priority list has not yet been investigated, nor was the difference between the outcome of prioritization using random and disease-specific tests been addressed. It was surprising that the overlap between the prioritized gene lists obtained using random and ID training sets was very high and above >~60% (Table 4; Fig. 2). Two web tools (PosMed and Suspects), which use a disease term to create the “training” set internally, also resulted in a high overlap between prioritized genes obtained using random and ID training sets (39–60%) when terms “mental retardation”, “diabetes” or “breast cancer” were used to prioritize four larger CNVs (all contain >15 genes/CNV) (data not shown). Despite the observed disease “non-specificity” of the web tools and training sets, our data show that candidate gene prioritization web tools in combination with OMIM training-gene set significantly enrich for genes with brain function, in comparison to non-prioritized genes, which could be useful for pre-selecting genes for further analysis from larger CNVs.

The aim of our tool comparison was not to identify the “best” prioritization tool, but to explore and apply them in light of the growing number of array-detected pathogenic CNVs, most of which have too many genes to allow for extensive literature searches individually. Currently, many of the available tools use different sources of information and algorithms for comparison with training genes and rely heavily on functional annotation of candidate genes. They are therefore inherently biased toward better characterized genes, which may result in less disease specificity (J. Gillis and P. Pavlidis, submitted). A user-friendly web tool which combines many datasets for identification of the most likely candidate ID genes is highly desirable in the current transition from conventional cytogenetics to ID-gene identification, as it can assist not only in exploring candidate disease genes for further molecular and functional analysis, but also in understanding the pathogenesis of ID.

Intellectual disability (ID) is a complex condition involving functional, developmental and structural alterations in the brain and/or nervous system. Despite hundreds of genes implicated in ID, the pathophysiological mechanisms of its origins remain largely unknown. In our study, we have compared the function of a series of prioritized candidate genes in de novo, familial and most common CNVs found in our ID cohort. A significant proportion (~40%, 26/66) of the prioritized genes in de novo deletion CNVs resulted in knockout mouse phenotypes manifesting abnormalities in nervous system development and are therefore likely involved in brain-related functions. This is consistent with a recently published report by Webber et al. (2009) in which a collection of 148 ID-associated de novo CNVs from the literature and >26,000 benign CNVs from different sources (both encompassing >4,000 genes) were assessed using MGI. ID-related CNVs (excluding benign CNVs) were found to be significantly enriched in genes with nervous system phenotypes when disrupted in mice. The MGI web tool which catalogs the mouse knock out phenotype data is therefore a very informative web tool for disease gene study in humans.

For pathway and bioinformatics-based analyses of gene functions, we applied several different tools to minimize the bias from one single method. A different output was obtained from different pathway web tools, although most of them integrate KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway dataset (http://www.genome.jp/kegg/pathway.html). We noted that the prioritized genes from de novo CNVs are more involved in neuroactive ligand–receptor interaction and MAPK signaling pathway than expected by chance according to both WebGestalt and Pathway-Express The involvement of MAKP pathways in ID-related genes have been observed recently (Sweatt 2001; Aoki et al. 2008) and 4/6 ID-related disease training-gene sets were also found to be enriched for genes from this pathway based on the Pathway-Express, Webgestalt and Gather web tool analysis. Among 10 genes from our de novo CNVs involved in these two pathways, five have already been reported in ID-related studies (RPS6KA1 (Zeniou et al. 2002), MEF2C (Lipton et al. 2009), CACNA1B (Ladera et al. 2009), MAPT (Koolen et al. 2008) and CRHR1 (Varela et al. 2006)). The remaining five genes are promising ID candidates with four demonstrating knockout mouse phenotypes related to nervous system abnormalities on MGI (S1PR4 (Meng and Lee 2009), OPRL1 (Manabe et al. 1998), RASA1 (Henkemeyer et al. 1995) and VIPR2 (Harmar et al. 2002)). The fifth gene, UTS2R, is a receptor of Urotensin-II which was found to act as a neurotransmitter in regulating various neurobiological activities including anxiety and depression in a recent study (do Rego et al. 2008). Other prioritized genes from our de novo CNVs that have been already found to be involved in ID include CHRNA4 (Elghezal et al. 2007), EHMT1 (Kleefstra et al. 2006), FEZ1 (Lee et al. 2005), KCNQ2 (Borgatti et al. 2004), MBD5 (Jaillard et al. 2009), MEN1 (Nakajima et al. 1999), SOX8 (Pfeifer et al. 2000), OTX1 (Laroche et al. 2008) and SF1 (Schlaubitz et al. 2007). This suggests that, although further improvements for disease specificity are necessary, the candidate gene prioritization tools will remain a promising avenue for narrowing down functional genes harbored within pathogenic CNVs.

In conclusion, we applied array CGH and bioinformatics approaches to explore ID-related genes. Our results show that high-resolution array analysis with a combination of different computational approaches is helpful in extracting ID candidate genes and associating them with functional networks involved in ID. We believe that the opportunities to identify and prioritize the most likely candidate genes will facilitate their further molecular analysis and delineate their role in the pathogenesis of ID.