Keywords

Introduction

Germline variants are nucleotide changes in a germ or egg cells and can be passed to a child from parents during conception. Since the variants are in reproductive cells, they are hereditary mutations and can be passed to future generations. Germline mutations account for ~5–10% of cancers [1]. Somatic variants are variants that arose in any cells except germline cells, i.e., sperm and egg, and cannot be transmitted to progeny. Somatic variants include mosaicisms in different subsets of somatic cells including clonal hematopoiesis of indeterminant potential (CHIP). Somatic variants are of particular interests because they are associated with various human diseases, including cancers.

Traditional germline/somatic genetic testing relied on a “panel” of gene testing with a focus on hotspot variants in a number of well-characterized driver genes, such as BRCA1 and BRCA2 [2]. With the advances and reduced cost of the next-generation sequencing (NGS) technology, whole exome/genome sequencing (WES/WGS) and targeted sequencing have become an option for detecting variants on a much larger scale and higher definition. A major challenge of WGS/WES analysis is the accuracy of mutation calling analyses on single nucleotide variants (SNVs) and small insertions and deletions (indels).

Development of SNV/Indel Variant Calling in the Past Years

NGS workflow usually starts with the fragmentation of the genome or targeted regions of genomes into small fragments, followed by alignments to reference genomes or genome re-assembly. The aligned/piled-up segments are used subsequently for variant detection. In early studies, the variant calling was performed by counting alleles at each site with simple cutoff rules to determine a variant call, which oftentimes lacks sensitivity to detect heterozygous alleles and does not provide confidence level of the genotype calls [3].

Uncertainties of variant calls arise when a sample’s coverage is shallow, sequencing read quality is poor, or a variant site has low allele count support [4]. After variant calling, layers of filters are therefore suggested to be applied to filter the variant calls to reduce the likelihood of sequencing artifacts in the call sets and increase the confidence of variant calls. An in-depth overview of filters that can be considered is described in section “Contributing Factors for Bogus Somatic Variant Calling” of this chapter.

Germline and somatic variant calling algorithms differ in the assumption of expected allele frequency. Germline variants are expected to have 50% or 100% allele frequencies to differentiate three basic genotypes harbor at each variant site, e.g., homozygous allele A (AA), heterozygous (AB), or homozygous allele B (BB). On the contrary, for somatic variant calling, the allele frequency displays a larger spectrum of variations symbolizing distinct stages of cell development. An increasing number of algorithms have been developed in the past decades to enhance the calling accuracy by incorporating error rate estimation and probability frameworks to model the genotyping and phasing likelihoods. Given the complexity of genomes, local re-assembly was also placed into the calling scheme to increase the confidence of variant calling. Table 3.1 provides a summary of available tools for somatic and/or germline variant calling to date. In the following section, we will introduce the algorithms implemented in a few popular variant callers.

Table 3.1 List of publicly available tools for variant calling in chronological order

Algorithm Basis of Germline SNV/Indel Variant Calling

Samtools mpileup [5] deployed the approach of read coverage depth counting to identify coverage characteristics of potential SNVs/indel sites. The coverage information was then fed into BCFtools [6] for variant calling based on general Bayesian likelihood. This approach is usually used for germline variant calling.

GATK HaplotypeCaller [7] is a widely used germline variant caller. An advantage of GATK is that the algorithm can be applied for the joint calling of a group of samples at the same time to control the false discovery rate and increase the sensitivity of low-frequency variant detection. In addition, GATK allows the re-assembly of reads to re-construct the real allelic segment or haplotype, which will be realigned to the reference genome to identify the variant sites. GATK HaplotypeCaller begins with defining active regions where abundant evidence has shown the presence of variants. Only the active region is used for variant calling to reduce the time on the assembly. With the assembly step, the variant calling is not only dependent on the read alignment against the reference genome but also the reconstructed haplotype. The overall GATK algorithm takes a divide-and-conquer concept by shredding the sequencing data into small chunks for parallel processing; however, its efficiency is still a concern when processing a large collection of samples for joint calling. Approaches have been proposed to address the performance issue when dealing with a large number of samples [8].

FreeBayes [9] applied a Bayesian framework to relate the likelihood of sequencing errors of the reads and the prior likelihood of a particular genotype. Also, the phase of haplotypes was inferred from the reads, and the non-uniform copy number of samples was taken into consideration. FreeBayes is usually used for germline variant calling, while it has been expanded for somatic calling [10]. FreeBayes shows good performance across sequencing platforms for SNV calling, but it tends to have a higher false-positive rate for indel sites [11].

DeepVariant [12] performs variant detection using a convolutional neural network (CNN) learning model implemented via the python TensorFlow library. DeepVariant identifies variants through learning the features in images of pileup reads surrounding putative variants and true genotypes. A version of DeepVariant for somatic calling is still under development.

Algorithm Basis of Somatic SNV/Indel Variant Calling

Mutect2 [13] as a part of the GATK toolkit shares a similar process of variant calling with GATK and is mainly used for somatic calling with matched, paired tumor-normal samples. Mutect2 also allows tumor only calling (see section “SNV/Indel Variant Calling”). Mutect2 calls SNVs and indels simultaneously via the local de novo assembly of haplotypes in an active region as described previously. Mutect2 reassembles the reads present in the active regions to candidate variant haplotypes. Each read is then aligned to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, log odds were derived to distinguish somatic variants from sequencing errors by a Bayesian somatic likelihood model.

SomaticSniper [14] is another somatic variant caller. SomaticSniper determines the somatic status of a variant site by comparing the site’s genotyping likelihood between normal and tumor derived from the MAQ tool [15] using a Bayesian approach. SomaticSniper implemented internal filters to exclude the sites with poor read/base quality or with low read support to reduce calling artifacts.

VarScan2 [16] relies on the results from SAMtools pileup or mpileup for somatic variant calling. At each variant site, VarScan2 compares the genotypes and supporting read counts between tumor and normal to determine the somatic status, and the call-set is refined with post-calling filters including the variant position in a read, strand bias, read coverage depth, variant frequency, homopolymer, mapping quality, and so on [16]. Of note, VarScan2 also allows the germline variant calling and detection of somatic copy number abnormality (SCNA).

MuSE [17] somatic calling starts with matched tumor-normal alignment BAM files. The alignment is first filtered for sequencing artifacts. The evolutionary F81 Markov substitution model of DNA is applied to describe the changes from reference to tumor allele compositions with estimates of equilibrium frequencies for all alleles and evolutionary distance. With the frequencies, MuSE derived a sample-specific error model and five-tier-based cutoffs to address the variations present in the frequency distribution in tumor and normal samples. The tier-based approach allows the MuSE to retain variants with low variant allele frequency to achieve a higher sensitivity.

Strelka2 [18] is an open-source somatic/germline variant caller developed by Illumina®. The somatic calling algorithm of Strelka2 is enhanced based on the original Strelka [19] method to account for tumor-in-normal contamination that is essential for liquid tumor variant analyses. Strelka first identifies indel regions and performs realignment. After realignments, Strelka derives a somatic variant probability using the tumor and normal samples and deduces the somatic status of a site after accounting for the status of loss of heterozygosity (LOH) or copy number change regions. Strelka applied a two-tier-based filtering strategy with distinct filters and sensitivity. Similar to other tools, post-filtering is applied by Strelka2 to handle different types of potential calling errors.

The variant calling is usually computationally intensive, particularly when the sample number is large. To improve efficiency, Illumina® has released a Dynamic Read Analysis for GENomics (DRAGEN) platform using a highly configurable field-programmable gate arrays (FPGAs) hardware to accelerate the analysis processes [20]. DRAGEN first identifies callable regions and assembles the haplotypes using De Bruijn graph method. The reassembly is aligned to the reference genome to identify the variants. The probability of all read alignments to the haplotype is calculated via the pair hidden Markov model that is speeded up using the FPGA and summed up for each read. In the end, the diploid genotype is calculated to determine the variant calls.

In the past few years, GPU-based read alignment and variant calling solutions have also been developed to reduce the WGS data processing time to a couple of hours. For example, NVIDIA Clara Parabricks pipelines include a somatic variant calling workflow that integrates GPU-based alignments by BWA-MEM and downstream somatic variant calling by Mutect2 [13] or DeepVariant [12]. Parabrick also allows germline calling using GATK HaplotypeCaller [7]. The pipeline reduces the time taken for a typical 30× WGS data by over an order of magnitude.

SNV/Indel Variant Calling Workflows

Variant calling workflow can be compartmentalized into four steps: data preprocessing, variant calling, variant filtering, and variant annotation. Each step has its challenges and strategies. We detail these steps as follows.

Data Preprocessing

The raw read quality can be examined using FastQC [21]. FastQC identifies the potential read issues before mapping. A good WGS/WES read library usually has an average read base quality >20 and a low level of duplicated or overrepresented sequences.

Selection of the reference genome is the first step for correct variant calling. The latest version of the human reference genome GRCh38 (Hg38) with improved resolution [22] is suggested for human variant analyses. Also, the reference is recommended to include decoy genome sequences for the alignment purpose to reduce misalignments, as well as virus sequences that are known in human to attract the viral reads. In addition, the alternative contigs from highly complex loci, such as the human HLA allele region, should be included to reduce SNV/indel calling artifacts. For read alignments, frequently used aligners are BWA [5], Bowtie2 [23], and Novoalign (http://www.novocraft.com/products/novoalign/). Benchmarks of short-read aligners indicated that the MEM algorithm implemented in BWA achieved a better balance between specificity and sensitivity [24, 25]. BWA-MEM is suggested to use when read length is greater than 70, while BWA-ALN for shorter reads [26].

Following alignments, duplicate reads generated from PCR artifacts are flagged using tools such as GATK MarkDuplicates to prevent downstream variant calling errors. Incorrect read alignment surrounding the indel regions frequently causes inaccurate substitution calls. These alignment artifacts can be reduced through indel realignments by GATK IndelRealigner or similar tools. Furthermore, the base quality produced by different library preparation protocols and sequencing instruments would have different levels of technical or chemistry errors. GATK toolkits comprised two tools, BaseRecalibrator and ApplyBQSR, to facilitate the correction of these systematic errors. These tools implemented machine learning approaches to model errors and adjust base qualities to obtain a more accurate overall base quality profile. Figure 3.1a shows a general workflow for the data preprocessing.

Fig. 3.1
figure 1

The workflow of the somatic variant calling of paired tumor-normal samples. (a) Data preprocessing steps from sample preparation to short reads mapping and calibration into binary version of Sequence Alignment/Map (BAM) files for paired tumor and normal samples. (b) Variant calling and annotation steps from paired tumor-normal BAM files to annotated somatic variants in VCF format

SNV/Indel Variant Calling

The next step is to choose appropriate variant callers. The GATK tool suite is well performed for the germline SNV/indel calling. A number of best practices for variant callings have been provided by GATK (https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows). For somatic variant calling, accurate identification of a somatic variant is still not trivial due to varied caller performance and tumor heterogeneity. Below we describe three common scenarios in somatic and germline variant calling as well as variant prioritization in cancer genomics.

Somatic Mutation Calling on Matched Tumor-Normal Pairs

Variant calling with matched tumor-normal sample pairs is the most common scenario for the identification of somatic variants (Fig. 3.1b). Most of the callers use the aligned BAM files of paired tumor and normal samples as the standard inputs. To identify low-frequency variants, a caller that can model the allele frequency is suggested, such as Mutect2, MuSE, and Strelka2 as detailed in the Introduction. Due to the differences of underlying algorithms and statistic modeling, the somatic variant callers differ in sensitivity and specificity when detecting variants at different levels of variant allele frequencies (VAF) [27]. Compared with Strelka and Mutect, SomaticSniper has a lower sensitivity and specificity when calling the variants with VAF <8%. However, the performance of SomaticSniper is comparable with Strelka and Mutect for variants with VAF >18%. The sensitivity of VarScan2 was increased with lower minimum allele fraction thresholds, which was however compromised with reduced specificity [28]. Therefore, a careful setting of thresholds to achieve a balance between sensitivity and specificity for each caller and a well-considered post-calling filtering strategy play important roles to assure the validity of final call sets.

Given the complex heterogeneity and structural rearrangements of tumor tissue, finding an appropriate somatic variant caller along with parameter fine-tuning and development of a solid calling strategy remain a major challenge for cancer genomics. To tackle this complexity and exploit each caller’s strength, a consensus voting to determine a valid variant call by multiple callers has gradually become a prevalent strategy in studies [29,30,31,32,33]. In addition to a simple voting strategy, machine learning has been incorporated into the consensus calling steps to improve calling performance. MutationSeq incorporated multiple sequence quality features derived from normal data based on Samtools and GATK, along with several sequence artifacts and low-frequency variant features to build classifiers to determine the somatic variants [33]. SomaticSeq [34] integrated five somatic callers from which feature sets were identified for each candidate variant position to build a classifier using a stochastic boosting machine-learning algorithm. Cerebro [35] applied a random forest classification model to generate a confidence score for each candidate variant derived from whole-exome sequencing data, which is limited to the coding region with >150× coverage. These approaches generally lack portability, i.e., users are required to obtain appropriate training data and have knowledge about the machine learning to re-train the models. In light of these issues, SMuRF [31] was developed and generalized for either WGS or WES data. SMuRF implemented a supervised machine learning using features derived from four variant callers along with mapping auxiliary features. NeoMutate [29], as another machine learning based caller, profiled a collection of seven distinct classifiers based on a training dataset of >3000 cancer variants from the Catalogue of Somatic Mutations in Cancer (COSMIC) database [36].

Machine learning–based callers determine the somatic status of a variant through different features of a variant harbors and therefore offer a higher level of flexibility than rule-based filtering strategy, especially for the tumor samples with intra-heterogeneity and normal tissue admixtures. However, a detailed curation of a set of ground-truth training data including both true-positive and true-negative variants is the key to optimize and refine the training models.

Mutation Calling and Prioritization on Tumor Sample Without Matched Normal Sample

In large-scale cancer genomic projects, it is common to have tumor samples without matched normal samples or with tumor-contaminated adjacent normal samples, due to the difficulties to collect patients’ blood samples. In these cases, the somatic variant calling oftentimes has a high rate of false positives, because it is almost impossible to confidently determine whether a called variant is of germline origin or somatically acquired. Mutect2 can call somatic mutations in tumor-only mode; however, the calling results require careful filtering for false positives due to the deficiency of corresponding germline information. Common germline SNPs can be eliminated by filtering against appropriate human genome variation databases such as Genome Aggregation Database (gnomAD). To date, limited number of studies have compared the performance of Mutect2 tumor-only and tumor/normal calling modes when both tumor/normal WGS/WES data are available. A tool designed specifically for somatic mutation calling on tumor-only WES samples is ISOWN [37], which utilizes a family of supervised learning classifications to distinguish somatic SNVs in NGS data from SNPs in the absence of normal samples. In terms of performance, the F1-measure of ISOWN is between 75.9% and 98.6% across different cancer types, cell lines, fresh frozen tissues, and formalin-fixed paraffin-embedded tissues. Calling somatic variants in tumor only WGS/WES data still warrants further improvement.

Due to these challenges, one can consider focusing on identifying putatively pathogenic variants in a set of genes of interest to specific tumors, irrespective of their germline or somatic origin (Fig. 3.2). Specifically, after basic variant quality filtering such as keeping variants with higher alternative allele count (>5) and VAF (>20%), and excluding those located in regions of low complexity or regions with extreme GC content, additional filters can be applied for the variant class and population frequency filter, i.e., only keeping protein-altering variants with minor allele frequency <0.01 in population frequency databases such as 1000 Genomes [38] and gnomAD [39]. In addition, optional filters can be added to increase the calling confidence such as keeping any variants that are available in the COSMIC catalog of somatic mutations or missense variants with a REVEL score >0.5 [36, 40].

Fig. 3.2
figure 2

The workflow of the variant calling of tumor sample without a matched normal sample. The workflow focuses on reporting potentially pathogenic variants regardless of their tumor or germline origin

Germline Mutation Calling and Prioritization

Identifying germline mutations in cancer predisposition genes has important implications in understanding tumorigenesis and guiding clinical practice. A common germline mutation calling workflow is illustrated in Fig. 3.3a. The recommended germline variant calling follows the GATK best practices including read mapping, alignment sorting, duplicated reads marking, and variant calling by GATK HaplotypeCaller [7]. Also, joint variant calling in multiple germline samples is recommended whenever possible because the genotype information at the population level can be leveraged to rescue the variant at a site with low coverage or with lower quality in a sample. The efficiency of GATK calling can be enhanced by a divide-and-conquer strategy, i.e., splitting the genomes into multiple small chunks for parallel variant calling followed by merging the output variant files (VCFs). After variant calling, the GATK Variant Quality Score Recalibration (VQSR) method is the suggested approach to filter the germline variants. VQSR relies on a deep learning method and therefore requires a sufficient amount of the variant sites to establish a reliable training model. The variant number for a single-sample WGS is usually sufficient for VQSR; however, for WES data, at least 30 samples are required to perform VQSR. When the sample size is limited, the variant call set can be filtered by the GATK VariantFiltration tool.

Fig. 3.3
figure 3

The workflow of germline variant calling and prioritization. (a) Steps of the joint calling of germline variants from pooled germline BAM files. (b) Steps of filtering and prioritizing potentially pathogenic germline variants or variants of unknown significance

To narrow down from the vast amount of germline variants reported by germline variant caller, usually only rare, non-silent coding variants in cancer-related genes, such as autosomal dominant or autosomal recessive cancer-predisposition genes, or genes that are recurrently mutated in tumors, are considered. For example, Zhang et al. evaluated germline mutations in a cohort of pediatric cancers in a curated list of 565 cancer-related genes based on expert reviews of the genes from American College of Medical Genetics and Genomics (ACMG) and genes from related literatures [41]. Specifically, after germline variant calling, QC-passed variants are shortlisted based on their frequencies in human populations such that only novel variants or the variants with minor allele frequency <0.001 in NHLBI Exome Sequencing Project (ESP) are kept [42]. These shortlisted variants can be then ranked based on (1) mutational class such as nonsense SNVs, missense SNVs, splice site SNVs, frameshift indels, or in-frame indels; (2) functional annotation databases such as PolyPhen2 and MutationAssessor [43, 44], (3) matches to curated variant pathogenicity databases such as NCBI ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), locus-specific databases such as IARC TP53 (https://p53.iarc.fr/) and BRCA Exchange (https://brcaexchange.org/); and (4) second hit on the intact copy in the tumor genome due to one copy loss or promoter methylation of the intact copy. Other popular databases for germline variant classification and prioritization include pLI and LOFTEE scores for loss-of-function variant prioritization [39, 45]; REVEL and CADD scores for missense variant prioritization [40, 46]; and dbscSNV scores for splice variant prioritization [47]. In addition, InterVar, an automatic interpretation of variants based on dozens of criteria laid out by ACMG and Association for Molecular Pathology (AMP), can be included to aid manual review of clinical significance [48]. Figure 3.3b summarizes the filtering steps to prioritize germline variants to be reported. The final ranked list of putatively pathogenic germline variants will then need to be manually reviewed and validated based on phenotype data, RNA-seq, and literature review. The whole prioritization process before manual reviews can be automated. For example, St. Jude Pediatric Cancer Variant Pathogenicity Information Exchange (PeCan PIE, https://pecan.stjude.cloud/pie), a free cloud service for non-commercial use, offer variant annotation and ranking service based on MedalCeremony pipeline to triage the germline variants into three categories, including Gold, Silver, and Bronze [41, 49].

Variant Annotation

To understand the context of the germline variants and somatic mutations, several tools are available to perform variant annotation on the called variants. Typically, the genomic locations of the variants are compared against a gene-based annotation database such as a GENCODE release (https://www.gencodegenes.org/pages/data_access.html) to determine if a variant is exonic, intronic, or intergenic [50]. Variants in exonic regions are further classified as missense variants, nonsense variants, silent variants, splice acceptor variants, splice donor variants, splice region variants, in-frame indels, and frameshift indels. Some annotation tools such as ANNOVAR [51], VEP [52], and SnpEff [53] also add population allele frequency from 1000 Genomes Project [38], NHLBI ESP [42], Exome Aggregation Consortium (ExAC) [45], and gnomAD [39]; and provide comparative genomics-based scores such as GERP++ [54], SIFT [55], PolyPhen2 [43]; and include machine learning–based pathogenicity scores such as CADD [46, 56] and REVEL [40].

ANNOVAR [51] is an annotation pipeline to functionally annotate variants. The workflow can be performed for either gene-based coding change annotations or region-based non-genic genomic element annotations. Moreover, ANNOVAR has extended functionality to identify and filter variants documented in specific databases, which can be used for enriching causal variants in diseases. ANNOVAR allows the annotation of SNVs and structural variants from a standard VCF. A web interface is available via wANNOVAR (http://wannovar.wglab.org/).

VEP [52] is another popular toolkit for variant annotation. Compared to ANNOVAR, VEP provides cell-line-based annotation. VEP generates transcript-level annotations, while ANNOVAR gives gene-level annotations. LOFTEE (Loss-Of-Function Transcript Effect Estimator, https://github.com/konradjk/loftee) is a very useful VEP plugin to evaluate the loss of function of splice variant [39]. VEP also allows the variant annotation of species other than human and mouse. In addition to local installation, users can perform annotations through the VWP web interface (https://uswest.ensembl.org/info/docs/tools/vep/online/index.html) or cloud virtual machine.

SnpEff [53] implements an interval forest algorithm to efficiently query, annotate, and predict the effect of the variants. SnpEff can run locally or via a Galaxy instance. Similar to VEP, SnpEff also provides a cloud VM for users. SnpEff allows the assessment of nonsense mediated decay (NMD), a functionality absent from ANNOVAR and VEP.

Contributing Factors for Bogus Somatic Variant Calling

Somatic variants generated from the variant callers oftentimes include false positives due to various types of contributing factors. Below we describe four common scenarios that cause bogus somatic variants calling and need to be considered in postprocessing.

Strand Bias

Strand bias is observed when reads are favorably sequenced for one strand over the other; only one strand of the DNA has reads covered in extreme cases. The sources of this type of artifact remain elusive but may be relevant to library preparation of analytic procedures [57]. This bias raises the concerns of variant call accuracy. GATK and Samtools both implement functionality to calculate strand bias scores.

Repetitive DNA Sequences

Repetitive DNA sequences are sequences that are identical or similar across the genome. They vary in sizes and frequencies and cause mapping ambiguities. RepeatMasker [58] can be used to mark or mask the repetitive sequences in the genome to reduce such ambiguities. The error rate of short reads sequencing has been shown to increase in genomic regions with high- and low-GC content or with long homopolymer runs [59]. Also, the GC-rich regions frequently suffered from low coverage issues. Segmental duplication can also cause some reads mapped to multiple places in the genome and give rise to unusual coverage. A BLAT (BLAST-like alignment tool, available at http://genome.ucsc.edu/cgi-bin/hgBlat) search can be used to determine if the flanking sequence of a variant with high coverage is uniquely mapped to a locus or multiple different loci. Those that can be mapped to multiple loci in the genome are recommended to be reviewed manually.

Variants in simple repeats or homopolymer regions, such as CCCCCCCC or ACGACGACGACG ([ACG]n), often lead to false-positive variant calls due to sequencing errors and following read misalignments. Indels in repetitive regions coupled with low alternative reads count support are usually filtered out. However, frameshift indels in disease-causing genes (e.g., ATRX, PMS2) require careful visual inspection and perhaps validation with an orthogonal sequencing approach to avoid missing important findings.

Low-Frequency Variants

VAF is the number of reads supporting the alternative allele divided by the total number of reads covering the genomic location. For germline samples, a heterozygous germline variant would have an approximately 50% VAF. Germline variants with significantly low VAF and a low number of alternative reads count could be due to sequencing errors. Germline variants with sufficient alternative read count and total read count but with low VAF may indicate mutation mosaicism [60]. If a large number of germline variants have low VAFs, it may suggest that the normal sample is contaminated by the tumor sample, which sometimes happens when the normal sample is collected as tissue adjacent to the tumor or blood after treatment. Paralogous mapping can also lead to VAF ranging from 10% to 25%.

Somatic mutations, on the other hand, exhibit a broader range of VAFs. A heterozygous somatic mutation in a copy-intact region would have an approximately 50% VAF. However, since tumor genomes are frequently subject to copy number alteration, the VAF of a somatic mutation could be around 33% or 67% due to one copy gain and could be close to 100% because of LOH. In addition, since patient tumor samples are rarely 100% pure, low tumor purity may further contribute to the global dilution of VAFs of somatic mutations in a tumor genome. Mutations with significantly lower VAFs than the truncal mutations in a tumor genome but with sufficient mutant read counts may suggest that they are subclonal. Somatic mutations with significantly low VAF and few alternative allele read counts could be due to sequencing error/artifacts and are recommended to be filtered out.

Germline Variant Contamination

A few somatic SNV callers, e.g., Mutect, have implemented specific filters to eliminate the potential germline variant contamination in somatic variants calling. Mutect allows the inclusion of a panel of normal samples (PON) and dbSNP database to exclude germline variants. The germline variant contamination can also be reduced by checking minor allele frequencies of mutations across different population frequency databases such as gnomAD and the 1000 Genome Project database. A recent study [61] reported that there would be one germline SNP among a median somatic SNVs prediction set containing 4325 somatic SNVs; the study also reported a negative correlation between germline SNP contamination and tumor purity.

Concluding Notes

Somatic variant calling from WGS/WES is critical for cancer genomics as it not only depicts the mutational landscape for a tumor sample but also serves as input data for downstream analyses such as mutational signature and clonal evolution. Consequently, there has been great interest in developing fast, accurate, and scalable methodologies and tools for variant calling across academia and industry. In addition to the tools mentioned above, there are also other variant calling tools acting on different data types and different platforms as described below.

Mitochondria Mutation Calling

Variants present in the mitochondria genome (mtDNA) is implicated in a wide spectrum of human disorders and diseases with highly divergent phenotypes and penetrance. The challenges of mtDNA variant calling arise from the circular topology of mtDNA as well as the homology between mtDNA and a part of the nuclear genome with mitochondrial origin (nuMTs). The mtDNA mutation load also varies greatly among tissues and organs from heteroplasmy (<100%) to homoplasmy (100%). The Human Mitochondrial Genome Database, Mitomap [62], provides a repertoire of reported mtDNA variants. Nuclear genome variant callers such as VarScan and LoFreq have been used for identifying the somatic mtDNA variants [63, 64]. MitoCaller [65] of the MitoAnalyzer toolkit was designed specifically to infer the mutation status of each position of the mitochondria genome using likelihood-based models and adapted an iterative alignment strategy to account for the circularity of the mtDNA genome. Importantly, discrepancies of mtDNA variant calling have been reported when using different reference genome and enrichment strategies [64], which should be taken into consideration when performing mtDNA variant calling and interpretation.

Long-Read Variant Calling

While short reads from paired-end sequencing were used by most state-of-the-art SNV callers to accurately detect variations in diploid genomes, they provide limited haplotype information that is required by some SNV callers, such as GATK HaplotyperCaller and FreeBayes. In addition, the accurate calling of SNVs in repetitive regions of the human genome is another challenge. Third-generation sequencing (TGS) technologies, including Pacific Biosciences and Oxford Nanopore (ONT), have the potential to overcome the limitations of short-read sequencing. Nevertheless, compared to short-read sequencing, long-read sequencing usually costs more and generates less-accurate long reads (e.g., sporadic indels in ONT data), posing challenges for accurate variant detection [66]. Current SNV callers using TGS data are mostly designed for germline variants calling and usually optimized based on the publicly available data from the Genome in a Bottle (GIAB) Consortium. Somatic SNV calling based on long reads technology is still underdeveloped.

NGS-based mapping tool such as BWA-mem is not suitable for long reads mapping. Instead, new mapping tools such as Minimap2 [67] and NGMLR [68] have been developed specifically for long reads mapping. Similarly, NGS-based SNV calling tools such as GATK HaplotyperCaller and FreeBayes are not recommended for variant calling on long-reads sequencing data. Instead, several variant callers have been developed specifically for long-reads data to leverage haplotype information available in long reads to improve the accuracy to call and phase SNVs in diploid genomes, as well as mapping variants in duplicated regions of the genome that are not possibly mapped using short reads. For example, Longshot [66] takes advantage of the haplotype information present in PacBio long reads to improve the SNV calling accuracy [69]. WhatsApp [69] introduces a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which takes full advantage of linkage information provided by PacBio long reads. Clairvoyante [70] uses a multi-task five-layer convolutional neural network model to predict variants. Other tools include DeepVariant for variant calling on PacBio data [12] and MarginPhase (https://github.com/benedictpaten/marginPhase) for simultaneous haplotyping and genotyping on Oxford Nanopore data.

Different tools differ in their precision and recall rate. In a benchmark study using PacBio data from GIAB, three callers, including Longshot, WhatsApp, and Clairvoyante, demonstrating very similar performance [66]. Compared to the previous three tools, MarginPhase performed moderately when focused on GIAB high confidence regions [69]. Another software, HELLO [71], has been created to integrate the short read and long read data to improve the robustness of SNV calling by leveraging the Mixture of Experts paradigm that uses an ensemble of deep neural networks (DNNs).

Variant Calling in Single-Cell Data

Single-cell sequencing has been the hotspot of functional genomics to elucidate the heterogeneity of cell compositions. Variant calling of single-cell data can aid the inference of the lineage relationship of cells. Although challenges remain for large-scale single-cell WGS/WES in terms of experimental design complexity and sequencing cost currently, single-cell RNA sequencing (scRNA) has been applied broadly to examine cell population dynamics and track the development of cell lineages. The preprocessing steps for scRNA data are relatively similar to the usual practice of WGS/WES calling. However, splicing-aware aligners, e.g., STAR [72] or GSNAP [73], are suggested for the read alignment. There are still not many callers designed specifically for single-cell data [74]. Trinity Cancer Transcriptome Analysis Toolkit (CTAT) is one caller with extended functionality for scRNA-seq SNV detection. SCIΦ is another tool that can perform jointly calling of mutations in individual cells followed by an estimation of the tumor phylogeny [75]. SSrGE [76] is an integrative workflow to connect genotype and phenotype in single-cell data which implemented GATK best practice and FreeBayes for variant inference. A few other studies used SAMtools mpileup approach for variant identification [77, 78]. Solid variant calling strategies in single-cell data will be of great needs in the following years.