Primary immunodeficiency diseases (PIDs) comprise a group of highly heterogeneous genetic disorders caused by defects in the immune system and can be categorized as lymphocyte deficiencies (T and/or B cell and NK cells defects), phagocytic defects, complement deficiencies and innate immunodeficiencies [1]. The prevalence of each individual PID varies in the population [2], ranging from 1/600 for selective IgA deficiency [3] to approximately 1/250,000 for chronic granulomatous disease [4]. However, the prevalence of various forms of PID is probably underestimated [5]. There are currently close to 300 forms of PID as defined by the International Union of Immunological Societies (IUIS) [6, 7]. However, around 3110 genes may be potentially PID-causing based on their biological functions and human gene connectome analysis [8] suggesting the existence of a large number of hitherto unrecognized diseases. Some of these diseases are monogenic, but a growing proportion are caused by a digenic/polygenic mechanism [9, 10].

Since the first methodology paper on next generation sequencing (NGS) was published [11], this technology has rapidly advanced the discovery of genetic variants underpinning human Mendelian disorders (MDs). Rapid advances in NGS facilitate, at ever decreasing costs, processing and analysis of genomic regions, ranging from targeted gene sets to high coverage of whole human genomes, including linkage analysis, homozygosity mapping and candidate gene approaches. Whole exome sequencing (WES) and whole genome sequencing (WGS) have become increasingly widespread approaches for the identification of MDs associated genes, by sequencing trio or quartet [12] families, by sequencing several individuals in a pedigree [13], a small patient cohort [14], or even by only sequencing the proband [15].

Gene discovery and understanding of the molecular basis of disease are essential starting points for making a molecular diagnosis and providing genetic counseling [16, 17] and may provide clues to the development of new therapeutic approaches for different diseases. In the past several years, the molecular basis of several forms of PID and their clinical consequences has been well documented [1825]. However, despite intensive work during the last decades, many forms of PID still do not have a defined underlying genetic defect and further exploration is needed to push this frontier forward. There are several publicly accessible tools or bioinformatic analysis pipelines available, which house numerous NGS analysis scripts for data quality control (QC), as well as tools for alignment, variant-calling, and annotation. However, a comprehensive pipeline designed specifically for PID candidate gene screening has not yet been developed. Herein we summarize practical considerations for the interpretation of the NGS data from PID patients and provide an overview of data and findings from our own PID cohort.

Pipeline for Screening the Candidate Gene

Next Generation Sequencing is a promising strategy for the study of human monogenic disorders. However, pinpointing the causal mutations in a small number of samples is still a major challenge. A large number of variants can be detected by WES and WGS and approximately 100,000 SNVs and 15,000 Indels (Fig. 1) have been identified per sample using the Agilent SureSelect Human All Exon V4 kit in our PID cohort. They include polymorphisms, sequencing artifacts and non-pathogenic rare mutations, and finding the causative gene remains a challenge (Fig. 1).

Fig. 1
figure 1

Pipeline of NGS-based gene identification in PID. The bar chat indicates the remaining number of variants after each prioritization step

There are several steps that may help us to reduce the number of candidate genes. Firstly, we can prioritize variants according to their impact on the sequence of the protein product. The majority of disease-causing variants are believed to be amino-acid-sequence-changing, including non-synonymous substitutions, splice site mutations and insertion/deletions as well as truncation of proteins due to a premature stop codon. To avoid the possibility of failure to find disease-causing mutations due to the incompleteness of the available annotation databases, using multiple databases [26] to annotate the variants to select the canonical one and the one with the most severe consequence is recommended. However, this method of annotation will not identify large deletions, even in a homozygous form; therefore, additional bioinformatics tools such as Exondel [27] (for homozygous large deletion) and FishingCNV [28] (for homozygous as well as heterozygous large deletion) using cross-samples comparisons are required in separate analysis step.

Secondly, as most of the PIDs are relatively uncommon, and supposedly caused by rare mutations, mutations found at polymorphic sites can be excluded by removal of mutations found in public or in-house databases and have a frequency of more than 1 %. The available public databases include the 1000 Genomes Project (KG) (2500 samples; http://www.1000genomes.org/), the Exome Variant Server (ESP) (6500 WES samples; http://evs.gs.washington.edu/EVS/) and the Exome Aggregation Consortium (ExAC) (60,706 samples; http://exac.broadinstitute.org/). It should be noticed that the ESP and ExAC database are mix of healthy individuals and patients, which should be carefully used when there is any patient affected by the studied disease in the database. In-house databases are often helpful to exclude platform-specific false positives or background noise. Analysis of the frequency of PID-related variants downloaded from public databases, including 5206 PID causative variants (selected substitutions or Indels from 7561 records), the Resource of Asian Primary Immunodeficiency Diseases (RAPID) http://web16.kazusa.or.jp/rapid/ database [29] and 4203 variants in PID genes [6] from The Human Gene Mutation Database (HGMD) http://www.hgmd.cf.ac.uk/ac/index.php database, show that most causative PID variants have an allele frequency of less than 1 % (Fig. 2 a,b). This also provides the rationale for exclusion of polymorphisms in the candidate mutation prioritization.

Fig. 2
figure 2

Frequency and predicted functional impact of the known PID variants. (a) (b) Number of PID variants with different ranges of frequencies where variants were downloaded from the RAPID (http://web16.kazusa.or.jp/rapid/) database and the HGMD (http://www.hgmd.cf.ac.uk/ac/index.php) database, respectively. The frequency data are based on our in-house database, ExAC (http://exac.broadinstitute.org/), ESP (http://evs.gs.washington.edu/EVS/) and KG (http://www.1000genomes.org/) databases, ‘NA’ represents variants not present in the corresponding database. (c) (d) Number of variants fall into different categories of functional impact based on prediction using different software, ‘NA’ represents the severity of functional impact of the variants that could not be predicted by the software

Examining the consistency of the inheritance model of disease and zygosity of the mutation is another important step to identify causative variants. It is rational to examine known genes with biallelic variants for a recessive model, hemizygous variants for an X-linked model in male patients, and heterozygous variants for a dominant model. Classification update of PID from IUIS in 2015 shows that, approximately 70 % are inherited as autosomal recessive, 20 % as autosomal dominant and 5 % as X-linked disorders [6] and the percentage of new autosomal dominant etiologies is increasing more than other types, however, it should be noted that sporadic PID cases are not uncommon [30]. These cases could be due to de novo dominant heterozygous mutations. We hypothesize that the penetrance of the disease is high for a family only has single or limited affected members. Hence, it is rational to hypothesize that unaffected members carry no disease predisposing/causing variant that is consistent with an inheritance model. Some complicated algorithms based on genetic rules could help us to narrow down the candidate gene list. For instance, WES data enable haplotype analysis based on identical by descent (IBD) information, which could be employed to pinpoint chromosomal segments that may harbor disease-causing mutations and performing homozygosity mapping (using HomSI, AutoSNPa and AgileVariantMapper et al. [21, 31]) will facilitate disease gene identification in consanguineous families as they are likely to be located in homozygous regions. In addition, multiple software tools such as pVAAST [32] and VarScan trio family calling [33] can be used to identify genetic variant(s) that directly influence disease risk in both consanguineous and non-consanguineous family data. VarScan trio support single family analysis while pVAAST provide probabilistic predictions of the disease- risk variants from single to multiple families’ data.

Most of the MDs causing mutations are supposed to be functionally critical and located in sites that are evolutionarily conserved. Several software tools have been developed based on this theory and may provide helpful information on the severity of impact of the variants. These tools include CADD, SIFT, PolyPhen2, GERP, GWAWA and MutationTaster, all of which predict the biological effect of the variants. These tools are powerful, but they do not categorize all variants correctly and sometimes they are inconsistent with each other. Both SIFT and PolyPhen2 have a difficulty in predicting stop-gain variants. Prediction of the severity and functional impact of known PID-causing variants show that a majority are predicted to be damaging, but some are predicted to be benign while others cannot be evaluated (Fig. 2c, d). Known PID genes can be utilized for training for some prediction models, which prioritize candidate genes according to the connection between the function, pathway, expression information on candidate genes and phenotype, including VASST, eXtasy, PHEVOR, Phen-Gen, Phenolyzer, Phenomizer, GeneCards and PosMed [8, 3436].

Some genes are prone to enrich phenotype-irrelevant mutations due to their long coding sequences [37] and evolutionary pressure [38]. It should be noted that these genes are the source of many false positive results. Fortunately, there are some tools or databases available to reduce noise, including Gene Damage Index (GDI) [38], a tool to indicate the degree of mutation enrichment in diseases under different inheritance models. Another available tool is Frequently Mutated Genes (FLAGS) [37], which provides a list of genes that are enriched for neutral mutations. For a given sequencing platform, a database of genes that are enriched in neutral mutations could provide helpful information in order to reduce platform-specific false variants.

On the other hand, internal homozygous exon deletion may be investigated using Exondel [27], a tool that use exons as a unit to examine deletions at the gene level. By comparison with multiple controls and in-house databases, the pipeline is able to identify disease-associated large exon deletions within a gene region.

In our cohort study, sequencing reads were aligned to the human reference genome (NCBI build 37.1, hg19) using both SOAPaligner and BWA with default parameter. Approximately 1,000,000 SNV and 100,000 Indels for the cohort have been detected by using SOAPsnp and GATK. After several steps of filtering such as removal of polymorphic site with frequency more than 1% in public database, keep the amino acid changing variants, there were still some mutations that were found in a majority of the sequenced samples and were predicted to be highly deleterious. These included 61 SNVs and 217 InDels and were carried by more than 50 % of the samples (206 patients) and 50 % of our PID controls (27 normal individuals from the patients’ families) in our cohort (data not shown), suggestive of platform association errors.

Here we summarize some of the variants that were found among our patients, including variants reported in our previous studies (Table 1 ), hypomorphic variants causing an incomplete phenotype (F3, F9, F12, F15, F16), a new or unexpected clinical presentation in known PID genes (F10, F19), variants in known genes but technically failed in Sanger screening (F11, F13, F14, F16), a large deletion mutation spanning an exonic region (F17) and mutations in potential modifier genes result in different phenotypes in two siblings with a same causative mutation (F7, F12, Table 1).

Table 1 Reported mutations in known PID genes in our cohort

The strategies described above to prioritize detected gene variants are commonly used. Although for many rare diseases, filtering variants with a frequency of greater than 1 % is reasonable, it may not be suitable for every PID, especially for recessive diseases with a relatively high incidence [5]. It should also be noted that their frequency may vary in different ethnic groups and some databases, for instance, dbSNP, should be used with caution, since it is built from a myriad of sources which do not only include polymorphisms but also disease-causing mutations. In-house databases could provide a suitable filter to remove platform or laboratory-specific artifacts. However, the sample size of the database and the population structure of samples has an impact on the sensitivity and specificity of the filter. For instance, if the in-house database only contains samples from a few big families, it could be enriched for rare, disease-associated heterozygous variants, which may lead to the exclusion of recessive disease-causing homozygous variants.

Disease-causing genes may not be found in all studied families, for instance, in our study, disease-causing candidate genes were only identified in 134 of 183 families sequenced, (134 patients in a total of 206 cases). Failure in identification of causative genes in the remaining families could be due to many reasons, one being that some genes or exons are not well covered or included in the capture arrays, including some known PID genes (LAMTOR2, EPG5, IGKC, IGHG1, IGHG2, IGHG3, IGHG4, IGHA1, IGHA2, IGHE and IL36RN). Large insertions or deletions, copy number variations or structural variations are also not easily traceable using short read approaches. In addition, some under-investigated disease-associated variants like SNPs in the UTR region, long non-coding RNAs and non-canonical or deep intronic splicing elements may not be detected by WES or targeted region sequencing. The advance of sequencing technologies and the improvement of capture may narrow this gap, and carefully making choices according to the characteristics and preferences of the different platforms may also be helpful in order to improve the success rate of genetic studies [45]. Beside all these technical limitation, new genetic etiologies that may be unrelated to the previous known genes/pathways, that is beyond our hypothesis, and is not easy to be identified, especially when the subjects are sporadic or a very small number of patients are available in family.

If no good candidate is found, revisiting all assumptions in the screening pipeline and analysis of the parameters is essential for finding the missing pathogenic variants. To reduce false positive variant calling in the homogenous regions of the genome, we only use unique mapping reads to call variants, and as a result, call variants in homogenous regions may be missed. If no good candidates are found, we can review variants in the homogeneous regions by analyzing all reads, including reads mapping to multiple regions, for variant calling.

Although the variant prioritization methods described above are able to exclude most of the irrelevant mutations, accurately identifying disease-causing mutations in the remaining patients is still a challenge. Some novel methodologies to assist the variant prioritization process could be very helpful. One possible way is to prioritize genes based on their position in a pathway, the hypothesis being that genes in the same or a related pathway are more likely to be relevant to the same disorder. Mapping known MDs genes into pathways and assessing their relative position is essential to test this hypothesis. If it holds true, we will be able to relax the filtering criteria, to reduce false negative results in disease gene identification.

During the past decades, much effort has been devoted to unravelling the genetic basis of different forms of PID, and the etiology of many monogenic diseases have now been identified [6]. However, even in some disorders with a presumed monogenic origin, and in most patients with a disease of digenic or polygenic origin, little progress has been made to date. The genetic analysis is also hampered by a variable clinical picture, where patients with mutations in recognized PID genes that are known to result in a given clinical phenotype, may, in fact, have features characteristic of a broad spectrum of different PIDs [1824]. In addition, family members carrying the same “pathogenic” mutation as the proband may be healthy, suggesting an influence of modifying genes in the former, beside, factors other than genetic one can make a contribution to the development of disease, which include environmental factor, age, diet et al.

Mutations in several genes that are potentially disease-causing have been observed in patients with common variable immunodeficiency (reviewed in [6]). However, these only account for a minority of cases (10–15 %), suggesting that additional genes contribute to the development of the defect. The same issue is noted in patients with IgA deficiency, the most common form of PID in Caucasians, where several genes [25] have been found to be associated with the deficiency, yet, the associated variants are only found in a minority of patients. Thus, a search for additional disease-causing genes is clearly warranted in patients with different forms of PID. However, although the methodology (from single gene sequencing, to exome sequencing and ultimately whole genome sequencing) has improved the success rate in finding disease-causing mutations, we strongly feel that a multi-omics approach (including epigenetics, transcriptomics, proteomics, functional pathway testing and analysis of the microbiome) is now needed to supplement the genetic data and thus, to further the PID field.