Introduction

Identification of pathogenic DNA variants causing diseases is one of the main aims of medical genetic investigations. In the past, when direct DNA sequencing possibilities were limited, this goal was achieved only in cases for which the region of the genome harboring a given mutation could be reduced to a manageable size by other procedures, such as family-based linkage or haplotype analyses. In the absence of large pedigrees or of other favorable factors that could help this localization process, disease-causing variants could remain undetected for years. The recent commercialization of next-generation sequencing (NGS) platforms has introduced a substantial methodological shift in mutation detection procedures. Specifically, it has allowed the querying of megabases of DNA at once, through computer-based alignment of millions of short sequence reads [1]. Parallel sequencing of panels of candidate disease genes and whole exome sequencing (WES), investigating all of our protein-coding DNA (~2 % of the human genome), have now become routine procedures in most laboratories.

As the NGS technique develops, the price per sequenced base decreases, to the point that sequencing entire individual genomes is not a prohibitive effort any more. Compared to WES, the use of whole genome sequencing (WGS) in human genetics, and especially in medical genetics, is still in its infancy. The reasons for this delay are mainly two: WGS involves higher costs compared to WES and requires more complex analyses at the computational level. Unlike WES, however, WGS allows the identification of complex DNA variants that are not limited to the coding sequences of the genome and the detection of non-conventional events involving large stretches of DNA (Table 1). Moreover, WGS displays an increased sensitivity with respect to WES in relationship to coding sequences as well, as it analyzes contiguous DNA and allows better sequencing and mapping approaches. More specifically, since it is not limited by constraints originating from discontinuous DNA templates (captured exons), WGS can take advantage of information deriving from a “regional” context. For instance, WGS can identify gene fusions, duplications of exons, and other genetic defects that would likely be missed in the absence of information from surrounding, non-coding DNA, which is seldom targeted by pre-WES purification procedures. Coverage (number of times a given nucleotide is sequenced) in WGS is also in general more uniform, since genomic DNA is provided to the sequencer “as is”, without undergoing selection procedures that may artificially create an uneven representation of the template material to be sequenced.

Table 1 Features of whole genome sequencing (WGS) vs. whole exome sequencing (WES)

Unfortunately, the wealth of information produced by WGS, despite being preferable from a theoretical standpoint, may as well represent a burden for the identification of DNA variants meaningful to medical genetics. Such variants typically consist of one or a few mutations that have to be distinguished from thousands of benign DNA changes, and their identification has often been compared to the detection of a needle in a haystack. To follow the same analogy, WGS provides better chances of identifying pathological targets than WES, but at the same time it increases the size of the haystack, to the point that innocuous DNA changes may no longer be recognized as such. The advantages of WGS procedures can therefore be fully achieved only when analytical approaches can efficiently differentiate abnormal DNA changes from the multitude of benign variants that determine normal human heterogeneity.

To better illustrate all of these concepts, this review will focus on the use of WGS as a tool to detect rare DNA variants with a high phenotypic effect, such as germline mutations in Mendelian hereditary disorders and somatic mutations in cancer.

The medical genome: generalities and common procedures

The human reference genome

Because of the complexity of the human genome, NGS reads from WGS projects cannot be efficiently assembled via de novo procedures, but have to be mapped to a standard template sequence, the human “reference sequence”. This human reference genome is a pooled sequence data of 13 healthy individuals with European ancestry [2], and has gradually evolved with the improvement of sequencing methods. It provides a common and unambiguous system of relationships between genomic coordinates and corresponding DNA bases.

Mapping of sequence reads and identification of variants

Following the generation of the raw DNA sequence reads by an NGS platform, the process of obtaining the full genome sequence of an individual (or, better, a reliable approximation of it), consists of a two-step, computer-based procedure. First, the short NGS reads are mapped to the reference genome by assigning to them specific genomic coordinates. This procedure is in general computer intensive and is achieved by the use of various algorithms (e.g., BWA [3], AGILE [4], NovoAlign [novocraft.com], or FastHASH [5]). Then, mismatches between the reference genome and the individual genome are assessed by a bioinformatic process referred to as “variant calling” (e.g., via software such as GATK [6] or VCMM [7]).

Both mapping and variant calling procedures can be highly parameterized and are susceptible to producing different outputs as a function of such parameters. Therefore, although for a given individual there is only one physical genome, made of DNA, at the present time we can only obtain one or more imperfect representations of it, made of bits and bytes. As a general rule, each step of any genome analysis produces both false positives, i.e., variants that are called but are not physically present in the genome, and false negatives, i.e., variants that are not called but are present in the physical genome. It is therefore important to minimize errors at these initial mapping and variant calling steps, since all of downstream analyses will be made on the assumption that these data are a faithful representation of the physical genome.

General filtering procedures

Since every WGS project produces on average ~4,000,000 called variants [8, 9], identification of mutations relies on a series of filtering procedures that have as goal to recognize rare DNA changes with a pathogenic effect and discard the multitude of variants that are unrelated to the disease studied. Comparison with databases reporting information from the unaffected population such as dbSNP [10], the ESP database (evs.gs.washington.edu), the Exome Aggregation Consortium (ExAC) (exac.broadinstitute.org), etc. represents the most consistent filtering step, under the assumptions that such public databases report (a) reliable information and (b) include polymorphic variants having no direct relationship with genetic diseases. However, these databases have limitations such as the presence of very rare and pathogenic mutations [11] and artifacts [12].

The frequency of the detected variants in the general population could be taken into consideration during filtering procedures, since alleles from some (mostly recessive) diseases may very well be present in the general, unaffected population [13, 14]. Furthermore, most of these entries contain information about genotype and allele frequency in different human populations, allowing as well other important analyses. In addition to comparisons with data providing information on biological variability, filtering from technical errors should also be put in place. NGS platforms as well as mapping and variant calling pipelines tend to produce technical noise (false positives) that is luckily rather constant and sequence specific. Comparison with a small set of control samples sequenced by the same NGS platform and processed by the same informatics pipeline would help to remove errors from the genomes.

Since a considerable amount of variants still survive general filtering, it may be useful to incorporate in the analysis a predictive tool that scores the impact of coding DNA changes on the corresponding protein sequence and, possibly, function. There are currently many software packages that can perform these tasks and compute whether a given variant potentially affects protein formation, expression, and/or interaction with other proteins. Among those that are used most often, we can cite SIFT [15], PROVEAN [16], PolyPhen-2 [17], and GERP++ [18]. Since prediction tools are not always concordant and their output based on different parameters, most studies use a combination of two or more tools to infer the putative pathogenicity of the variants [1921]. However, it is important to stress that all these packages provide information of predictive nature, and that filtering procedures based on them will have in the end only a relative value.

Databases of disease-associated variants

Many public databases reporting the direct relationship between DNA changes and specific traits exist and are publicly available. Some of them contain information on variants that underlie or are associated with diseases, such as the Human Gene Mutation Database [13] or the Online Mendelian Inheritance in Man database (OMIM) [22]. For structural variations, the Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER) [23] lists copy number variations present in the control and affected populations. For cancer studies, the Catalogue of Somatic Mutations in Cancer (COSMIC) [24] stores bona fide somatic mutations related to human cancers. Some other databases collect the results from pharmacogenetic studies to contribute to the development of individualized treatments (PharmGKB [25], Pharmaco-miR [26]). All these databases have increased substantially in size in recent years, due to NGS and larger and larger genetic studies. If integrated in WGS endeavors, they can be of great help in highlighting genetic variants associated with pathological traits.

Germline mutations

WGS in hereditary diseases

Pathogenic mutations with a high phenotypic effect can either be inherited from a person’s parents (germline mutations) or be acquired throughout life (somatic mutations). Pathologies resulting from germline mutations, which can be transmitted to the following generations, are commonly referred to as hereditary diseases, while somatic DNA injuries are usually not transmittable to the offspring and lead in general to tumors. Both germline and somatic mutations can be efficiently identified by WGS; however, technical and analytical approaches to detect these pathogenic variants are rather different (Fig. 1). A review of the recent literature shows that hereditary complex disorders, for which a combination of common variants in different genes and environmental factors contribute to the pathology, are still mostly investigated via non-NGS techniques. Conversely, WGS is beginning to be systematically used as a tool to understand the causes of Mendelian inherited diseases, resulting from germline mutations in one gene (e.g., [20, 27, 28]).

Fig. 1
figure 1

Schematic workflow for the detection of potentially pathogenic DNA variants in hereditary diseases and in cancer. In hereditary diseases, the information from several genomes from a control cohort (white individuals) is assembled to produce a “metagenome” that includes all possible variants (both small events and copy number variations, or CNVs) that are allegedly not causing disease in the general population (blue bars and boxes). Potentially pathogenic variants are then deduced by comparing the WGS information from a patient (black individual) with that of the metagenome. In cancer, there is no need to query a control cohort, since the control information is provided by the genome of normal cells from the same patient. Regardless of their frequency in the general population, all these variants are then subtracted from the pool of DNA changes obtained from the tumor genome, making the detection process of pathogenic variants a more efficient and straightforward procedure

The initial approach for the detection of Mendelian mutations by WGS is virtually the same as that used for WES-based studies. It consists of focusing first on the coding region of genes, more specifically on variants leading to a change in the amino acid sequence of future proteins. However, the real power of WGS emerges when events involving non-coding regions are investigated. Compared to other techniques, WGS allows us to specifically extract information from parts of the genome that are usually neglected, and at a base-pair resolution. Recent WGS studies have indeed shown that a number of unsolved cases from Mendelian disease can be explained by mutations in non-coding regions and, at various degrees, involving coding parts of disease genes (e.g., [8, 29]). Similar examples include the direct identification of gene disruption by the insertion of mobile elements, which are already known to play a significant role in the molecular etiology of hereditary diseases [30], but that are difficult to identify by other NGS techniques than WGS (own unpublished results).

It is important to note that, regardless of the type of mutation, in all Mendelian disorders and within single pedigrees, pathogenic variants always co-segregate with the disease in affected individuals. Therefore, all patients within a family should necessarily share the same mutation(s) but not necessarily the same innocuous DNA variants. This elementary concept of human genetics is one of the most powerful elements of investigation in NGS studies, including WGS, since it allows us to discard benign variants that cannot be immediately recognized as such. One of the first WGS projects that exploited this paradigm is the one performed by Roach et al., who, following the comparison of individual WGS output from two healthy parents and two affected children, could reduce the number of candidates genes, genomewide, from thousands to only four [31].

For monogenic disorders with no genetic heterogeneity, a similar strategy could be extended from a single pedigree to a group of unrelated patients. In these cases, merging genomic data from different patients and different pedigrees represents a much more powerful approach, because unrelated affected individuals would all tend to have rare variants (mutations) only in the disease gene [27]. Conversely, in Mendelian disorders displaying genetic heterogeneity, this approach may lead to false positive results, highlighting as pathogenic benign variants that may be coincidentally shared by a group of patients, and therefore it should not be used.

Identification of recessive, dominant, or X-linked mutations

Since heterozygous recessive mutations do not cause disease, they can be present, even at non-negligible frequencies, in the general population [13, 14]. Patients would conversely be either homozygotes for a mutation or compound heterozygotes for two different mutations in the same gene (Fig. 2). This simple genetic concept has tremendous consequences in WGS-based searching for mutations, as only about a dozen genes will harbor rare, non-synonymous variants in the homozygous or compound heterozygous state genomewide. Other methods could be used to identify pathogenic variants, such as the elevated granularity of WGS data, which allows precise haplotype phasing in trios or quartets, to the point that meiotic recombination events in the parents of an index case could be detected. In other words, it is possible to identify all the regions of the genome identical by descent in affected individuals of a kindred, which by definition should harbor both the mutation transmitted from the father and from the mother [31, 32]. An extension of the same concept is autozygosity mapping, which reaches its highest possible precision when WGS information is used. This technique scores stretches of homozygous alleles (usually SNPs) in consanguineous families segregating a recessive disease to detect the single homozygous recessive mutation originating from a heterogeneous mutation in a common ancestor. Since alleles are transmitted from one generation to the other in large genomic “blocks” by meiotic recombination, the genomic region surrounding this homozygous mutation would also be completely homozygous for benign variants, which would act as a beacon for the presence of pathogenic recessive mutations [33].

Fig. 2
figure 2

Possible configurations of pathogenic mutations for autosomal recessive and autosomal dominant conditions. Structural events are usually better or exclusively detected via WGS procedures and therefore genotypes b, d, e, and g may be easily missed by other sequencing techniques

In contrast, in autosomal dominant conditions, only one variant in a specific disease gene gives rise to the pathological phenotype. Compared to recessive cases, it is more difficult to infer pathogenicity of a given DNA variant since, in absence of other information, in principle all of the rare DNA changes detected genomewide can be the mutation causing disease (Fig. 2). Filtering steps as well as the careful use of clinical and public databases and pedigree-based co-segregation analyses become therefore essential. In case the condition is known to display no genetic heterogeneity, then the most powerful tool to infer pathogenicity becomes data merging across different unrelated patients, for the reasons described above.

Recent literature has shown that a substantial proportion of seemingly dominant cases may also result from the presence of de novo mutations [3437]. In such cases, trio analyses would be the best strategy to choose, since appearance of de novo mutations would be easily scored by subtracting the list of genomic variants of the patient from those of their parents, without in principle the need to filter data from common variation databases.

Finally, procedures for X-linked cases would be substantially the same as those for dominant ones, with the exception that the genomic region to be considered would be limited only to the X chromosome.

Somatic mutations

WGS in cancer

As mentioned, DNA errors can also be acquired somatically through life. Because of age, environment, diet, etc., these mutations are usually not transmitted to the offspring but can accumulate and lead to disease. This is the case of most cancers, where somatic defects lead to a dysregulated cell growth and eventually to tumor and metastasis.

Detection of pathogenic somatic variants via WGS procedures is a much simpler effort, compared to that involving germline mutations in hereditary diseases. Indeed, the cancer genome of a given patient can be directly compared with that from tumor-free tissues from the same individual (usually blood leukocytes). This process eliminates the need for constructing an imprecise reference “metagenome” resulting from cohorts of unrelated patients. In this context, the fact that a given individual’s germline genome carries polymorphic variants, rare DNA changes, or even large structural variations with respect to control genomes is completely irrelevant, since the mutations that count are those present in the cancer genome only (Fig. 1). In other words, the germline genome represents a baseline dataset used as a subtracting factor to obtain an unbiased count of all the acquired somatic mutations. Ley et al. were among the first to apply this method on an acute myeloid leukemia, identifying in the end two known mutations for cancer progression and eight novel mutations that could be used for possible targeted therapy [38].

Cancer appearance, progression, relapse

Since cancer is an evolving disorder, WGS can be used to score tumor progression, relapse, and remission by analyzing its genomic content at different time points. Concerning tumor progression, a study by Ding et al. [39] investigated basal-like breast cancer via four parallel WGS procedures: on the peripheral blood of the patient to obtain a baseline genome, on the primary tumor to detect the somatic mutations, on a brain metastasis to understand metastatic transformation, and on the genome of a human-to-mouse primary tumor xenograph to understand the mechanisms of tumor changes following transplantation.

Tumor evolution in the context of therapeutic treatments can also be studied by WGS, as shown by a report on clonal evolution in acute myeloid leukemia cells, a cancer characterized by frequent relapses following chemotherapy treatment [40]. In this work, the authors noted two distinct patterns of tumor genome evolution: in the first one, the primary tumor clone gained mutations that made it to evolve into the relapse clone and therefore survive treatment; in the second one, chemotherapy applied a selective pressure enabling a specific sub-clone of the initial tumor to expand, and again survive treatment.

Tumorigenic pathways

Although every cancer has a unique landscape of somatic events, in some instances mutations tend to affect common genes, highlighting dysregulation of shared, important pathways for tumor progression. Analyses aimed at identifying such pathways can be done by considering multiple cases of the same tumor, to increase the signal represented by driver mutations (DNA changes providing selective advantage to a cancer cell clone) and minimize the noise deriving from passenger mutations (DNA changes that do not contribute to cancer etiology but accumulate in rapidly expanding clones). In a way, such analyses are very similar to those outlined above for hereditary diseases, for which multiplication of the patients’ or controls’ genomes to be analyzed helps to eliminate DNA changes which are not relevant to the disease. This approach has been applied to a relatively large number of different tumors, for a total of ~150 genomes analyzed [4149]. In addition to providing new insights into mutation-based differential prognosis, tumor molecular classification, progression mechanisms, etc., comparative WGS on multiple tumor samples helps identifying tumor signatures and mutational spectra across different types of cancer [50] or within the same tumor type, such as smoker and non-smoker lung cancer genomes [43].

Conclusions

From a genetic standpoint, there is nothing more exhaustive than the full sequence of a genome. It is therefore easy to predict that, when costs associated with WGS substantially decrease and better analytical tools are available, this procedure will become the technique of choice for most medical genetics investigations. WGS can in fact detect features of the human genome, such as copy number variations and intronic mutations, that other techniques cannot or struggle to identify, and that are becoming increasingly relevant to human genetic pathology. Furthermore, it is conceivable that many different genetic tests, which are currently performed as individual analyses (array CGH, sequence-specific mutation detection, gene panel screening, etc.) could be soon replaced by a single WGS run, which in fact can provide all of this information at once.

However, for WGS to become a popular tool in research and a routine test for DNA diagnosis, a few improvements still have to be made. From a clinical standpoint, diagnosis of the disease has to be very accurate, especially in terms of inheritance, because all downstream analyses would depend on it. Also, since a person’s whole genome is unveiled, the risk of incidental findings is very high, revealing the need for integrating ethical policies adapted to this specific test. On the technical side, sequencing errors and noise have to be better estimated and eliminated, since false positive findings or long processing times are not compatible with diagnostic needs. This could be done by optimizing the reference genome, databases of common variants, prediction software, and also pre-WGS experimental design (e.g., by including specific information of a patient’s family). In a more distant future, complex diseases will probably also be approached by WGS, to fully exploit the wealth of information that this technique produces in the context of variants that are not pathogenic per se, but that can cause disease via additive or multiplicative effects.