Abstract
Next generation sequencing (NGS) has traditionally been performed in various fields including agricultural to clinical and there are so many sequencing platforms available in order to obtain accurate and consistent results. However, these platforms showed amplification bias when facilitating variant calls in personal genomes. Here, we sequenced whole genomes and whole exomes from ten Korean individuals using Illumina and Ion Proton, respectively to find the vulnerability and accuracy of NGS platform in the GC rich/poor area. Overall, a total of 1013 Gb reads from Illumina and ~39.1 Gb reads from Ion Proton were analyzed using BWA-GATK variant calling pipeline. Furthermore, conjunction with the VQSR tool and detailed filtering strategies, we achieved high-quality variants. Finally, each of the ten variants from Illumina only, Ion Proton only, and intersection was selected for Sanger validation. The validation results revealed that Illumina platform showed higher accuracy than Ion Proton. The described filtering methods are advantageous for large population-based whole genome studies designed to identify common and rare variations associated with complex diseases.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Genome-wide sequencing aims to provide a deep understanding of sequence variations as a foundation for investigating the relationship between genotype and phenotype. Thus, the 1000 genomes project (http://www.1000genomes.org) has yielded important insights into the study of population genetics, including the investigation of causal variants of genes for various human syndromes (Rusmini et al. 2016). In recent years, next-generation sequencing (NGS) technology is evolving rapidly and several new sequencing platforms were released. Whole-genome sequencing (WGS) has comprehensively been used for detecting genomic and genetic variations such as single-nucleotide variants (SNVs) (Kim et al. 2009; Wang et al. 2008; Zong et al. 2012), copy-number variations (CNVs) (Conrad et al. 2010; Sebat et al. 2004; Wong et al. 2007; Zong et al. 2012), insertions and deletions (InDels) (Abyzov et al. 2015; Pleasance et al. 2010; Sudmant et al. 2015), and chromosomal rearrangements (Tsai and Lieber 2010). However, the low cost, high sequencing coverage and more readily interpreting protein-coding exons associated with whole-exome sequencing (WES) platform when compared to the WGS platform, has led to more exome sequencing than genome sequencing (Gonzaga-Jauregui et al. 2012).
In 2013, most WES studies are performed on Illumina’s HiSeq while the Thermo Fisher’s SOLiD and Roche 454 platforms account for a small share of activities. The development of semiconductor-based sequencing has recently emerged as an alternate sequencing platform with the commercial release of the personal genome machine (PGM) by life technologies (Gonzaga-Jauregui et al. 2012; Merriman et al. 2012; Quail et al. 2012; van Dijk et al. 2014). It has potential to be cost-effective with a rapid turnaround time, perhaps faster than other current methods that rely on optical imaging and specialized fluorescent-labeled nucleotides. Although the PGM produces megabases (Mbs) of DNA sequence within hours, its net output is insufficient to generate WES data with adequately high coverage per base. Later, gigabases (Gbs) of data has been acquired, thus enabling investigators to conduct high quality WES, transcriptomes and, eventually, WGS rapidly at low cost after Thermo Fisher released the semiconductor sequencing instrument, Ion Proton.
Previously, the variant calling performance of the Ion Proton sequencer was assessed by comparing it with variants called by HiSeq and MiSeq sequencers (Boland et al. 2013; Motoike et al. 2014; Quail et al. 2012; Zhang et al. 2015). In this study, we evaluated the performance of the Ion Proton semiconductor-based sequencer for WES and Illumina HiSeq 2500 for WGS based on the accuracy of the variation data and the GC rich/poor regions of the Korean individual genomes. The variants called by these platforms are validated with Sanger sequencing. In addition, 59 genes belonging to the American College of Medical Genetics and Genomics (ACMG) were included.
Materials and methods
Sample collection and DNA preparation
The samples used in our study came from ten Korean male and female individuals collected from KPGP (Korean Personal Genome Project). Blood samples were collected and genomic DNA (gDNA) extracted using DNeasy Blood & Tissue Kit (Qiagen, CA). DNA was quality-checked on agarose gels and quantified using a micro volume spectrophotometer (NanoDrop 2000).
Exome capture and sequencing with Ion Proton
Exome capture for all ten samples was carried out using a commercially available in-solution method (Ion AmpliSeq Exome Kit), following the manufacturer’s guidelines. The gDNA (50 ng) was used for library preparation using the Ion AmpliSeq Library Kit 2.0 kit (Thermo Fisher Scientific, USA). The DNA templates for sequencing were prepared on the Ion Chef system using the Ion PI™ Hi-Q™ Chef Kit. Sequencing was carried out on the Ion Proton™ (Thermo Fisher, Carlsbad, California, USA) at TheragenEtex, South Korea.
Whole-genome sequencing with HiSeq2500
After quality control to ensure lack of genomic degradation, DNA samples (1 μg) were subjected to library preparation. Libraries with short inserts of 350–450 bp for paired-end reads were prepared using Truseq DNA sample prep kit following the manufacturer’s protocol for Illumina. The whole genome DNA was sequenced using Illumina HiSeq 2500, including an adaptation of the pairwise end-sequencing strategy.
Post-sequencing analysis (read alignment, coverage, variant calling)
Burrows-Wheeler aligner [BWA,v0.6.2; (Li and Durbin 2009)] was used to align the sequencing reads, with default parameters, to the Genome Reference Consortium assembly, GRCh37. Alignments were converted from sequence alignment map (SAM) format to sorted, indexed binary alignment map (BAM) files (SAMtools v0.1.18). The Picard tool was used to remove duplicate reads. We used the BEDtools (v2.26.0) to intersect the aligned genomic regions overlapping between WGS and WES BAM files (Quinlan and Hall 2010). Genome Analysis Tool Kit (GATK) software tools (v2.3) were used for improvement of alignments and genotype calling and refining with default parameters (DePristo et al. 2011). Genotypes were called by the GATK UnifiedGenotyper, and the GATK VariantRecalibrator tool was used to score variant calls by a machine-learning algorithm and to identify a set of high-quality SNPs using the Variant Quality Score Recalibration (VQSR) procedure with default parameters (McKenna et al. 2010). In addition, all the variants from both platforms were functionally annotated using databases from Korean Center for Disease Control (KCDC), American College of Medical Genetics (ACMG), Exome Aggregation Consortium (ExAC), 1000 Genomes Project, Exome Variant Server (6500), and GC content in order to identify common variations. In the last filtering step, markers fall below 40% and above 60% GC content were selected for subsequent downstream analysis. The whole analysis process was shown in Fig. 1.
Identification of GC rich/poor regions using RefGene
The coding regions corresponding to each transcript were counted using RefGene. The GC ratio of each gene was calculated by removing overlapping transcripts and counting number of G and C base in the exon (coding) regions (Rieber et al. 2013). The GC distribution plot was drawn using R package (Fig. 2).
Validation of variants by Sanger sequencing
Variants for Sanger sequencing validation were randomly selected from the following filtering procedure: filter column ⇒ pass, ANN.Imapct ⇒ morderate, rsID ⇒ yes, exac : more than average, 1000 Genomes: Caucasian, African high frequency (more than average) filter out, GC_contents ⇒ high (maximum) & low (minimum), KCDC DB frequency ⇒ 0.3–1 homo (10%), hetero (90%, 0/1). PCR primers were designed using Primer 3 software, to produce amplicons (ranging in size from 100 to 200 bp) containing variants of interest in approximately the center of the amplicon. PCR was performed in a total volume of 10 μl containing 5 μl AmpliTaq Gold®Fast PCR Master Mix (Applied Biosystems), 1.5 μl of each primer with a concentration of 0.5 pmol/μl, and 2 μl genomic DNA in a concentration of 40 ng/μl. After quality-control steps using agarose gel, the product was purified and pooled. Final PCR products were quantified using Qubit (Invitrogen, USA), followed by conventional Sanger sequencing was performed.
Results
Sequencing, coverage, and alignment statistics
In this study, ten Korean individual samples were subjected to high-throughput sequencing using two different platforms, in order to identify the sensitivity and completeness of variants. Whole-genome sequencing (WGS) was performed with 101 bp paired-end sequences to ~30X coverage using Illumina HiSeq 2500 and whole-exome sequencing (WES) was performed by exon capture with 170 bp to a ~100-fold coverage using Ion Proton sequencer as described in materials and methods. A summary of the Illumina and Ion Proton derived data for ten Korean individuals is shown in Table 1. We aligned reads from both platforms to the human reference GRCh37, henceforth referred to as hg19. Overall, a total of 10,130,712,476 reads from Illumina and 391,868,380 reads from Ion Proton, were mapped to the hg19 using the software, BWA (Li and Durbin 2009). For all samples, on average 95% of the reads could be mapped to the human reference genome (hg19) (Table 1). Analysis of raw binary alignment map (BAM) files revealed that sequence reads covered more than 94 and 89% of the whole genome for Illumina and Ion Proton, respectively, with a depth of greater than ten reads per base.
Variant calling (Illumina vs. Ion Proton)
We sought to compare the sensitivity and accuracy of each platform for SNV calling. The overlapping regions between two different sequencing data (WGS and WES) were computed by using BEDtools (Quinlan and Hall 2010) implicated in the further variant analysis. Software suites, such as the GATK (McKenna et al. 2010), have been designed to manage large-scale sequencing projects. We used GATK UnifiedGenotyper tool to call multiple variants for genetic variants from both platforms. Targeted realignment and base recalibration were performed using the GATK tools as explained in “Materials and methods”. Multiple variant calling for ten samples produced 54,766 and 90,057 SNVs from Illumina and Ion Proton, respectively (Supplementary file S1 and S2). As a result of variant analysis with two different sequencing methods, considering read coverage and mapping quality, we found more SNVs in WES data using Ion Proton because variant calling with unusable and limited read coverage of data were eliminated more from WGS data compared with WES data holding high-genome coverage. GATK’s best practices include a variant filtering step following Variant Quality Score Recalibration (VQSR). This “VQSR filter” uses annotation metrics, such as quality by depth, mapping quality, variant position within reads, and strand bias, from “true” variants to generate an adaptive error model. It then applies this model to the remaining variants for calculating a probability that each variant is real. Using VQSR, we further filtered out multiple raw variants. Hence, VQSR filter count produced 49,979 (91.2%) and 73,417 (81.5%) SNVs from Illumina and Ion Proton, respectively (Fig. 3). These SNVs were further clustered based on depth coverage (>10) and we found that 8.24 and 3.0% SNVs were filtered out from Illumina and Ion Proton, respectively.
Functional annotation of variants
We used various databases including KCDC, ACMG, ExAC, 1000 Genomes Project, and Exome Variant Server (6500), in order to provide accurate functional annotations of variants. Based on the results, we applied following filtering options: “VQSR = “PASS”, 1000 Genome Asian 0.3<, KCDC 0.3<, GC region > 40, 60< (Fig. 2), Impact: High, Moderate” and identified that 532 SNVs were shown as intersection between Illumina (707 SNVs) and Ion Proton (737 SNVs) multiple variants (Fig. 4; Supplementary file S3). In addition, we found that the number of SNVs was drastically reduced because the GC regions between 40 and 60% were filtered out.
Marker filtering and experimental validation of SNVs
To investigate the accuracy of each sequencing platform, we selected ten variants from Illumina only, Ion Proton only, and common from both platforms using the following filtering options: filter coloum ⇒ pass, ANN.Imapct ⇒ morderate, rsID ⇒ yes, exac: more than average, 1000 Genomes: Caucasian, African high frequency (more than average) filter out, GC contents ⇒ high (maximum) & low (minimum), KCDC DB frequency ⇒ 0.3–1 homo (10%), hetero (90%, 0/1). We have validated each of the ten variants that were obtained from WGS and WES using Sanger sequencing. Our objective of Sanger sequencing validation was to confirm the filtered variants in order to identify heterozygosity status of variants. We designed a total of 30 primer pairs for Illumina, Ion Proton and common variants to amplify each of the ten variants from the coding region of different genes (Supplementary file S4). When comparing the results of common variants and Sanger sequencing, we could not find any false-positive. It was due to achieving optimum sequencing depth for both WGS and WES. Compared to the human reference genome (hg19), we found that 98 and 94% of the Illumina and Ion Proton variants, respectively, showed exact matches in Sanger sequencing validation (Fig. 5). In case of the SNP in RNF222 (G>A), homozygous variant (A/A) was found in 7, whereas heterozygous variant (A/C) was observed in 3 of the 10 Proton samples. On the other hand, in case of the SNP in KRTAP10-1 (G>A), eight samples showed homozygous and two exhibited heterozygous variant (G/T) (Fig. 5; Table 2). In summary, Sanger sequencing validation revealed that the Illumina showed more accuracy than the Ion Proton platform.
Discussion
To gain fast and cost-effective NGS sequencing and analysis of the human genome, it is important that sampling and library preparation steps must be simplified, but also to carry out DNA sequencing on a single technology platform. Moreover, with the availability of a high quality reference genome, resequencing with “low-cost per base” technologies have been accomplished by many commercial sequencing companies. However, with hundreds of human genomes currently being deciphered, the question of whether variant filtering may provide the accuracy of alternative alleles or not. Our analysis on the accuracy of SNVs suggests that a significant number of variants display higher accuracy when the data compared from two sequencing platforms.
We realize that all the NGS service providers may postulate “best practice” protocol from sample-prep to the bioinformatics pipeline, and that include different analyzing strategy. Further, platform related issues, coverage related issues, and SNP related issues (false-positives and quality) may hindered the process of variant calling. In this study we included BWA + GATK and VQSR for variant calling in our pipeline. Though it’s a best practice following worldwide, we focused on criteria based on GC rich/poor regions. Those variants falling between 40–60% GC regions were omitted and 707 and 737 variants were finally selected from Illumina and Ion Proton data set, respectively, of ten Korean individuals. In addition, using a Sanger sequencing approach we report that the validation of SNP calls is significantly improved for variants that are reported from WGS and WES platforms.
Conclusions
In this study, we have examined the accuracy of variant calling using Illumina’s HiSeq2500 and Thermo Fisher’s Ion Proton technologies. We showed the strengths and weaknesses of each technology with respect to multiple SNV calling sensitivity and specificity through detailed filtering strategies. We significantly extended earlier filtering options for variant calling by including GC rich and poor regions. Sanger sequencing validation revealed 98 and 94% accuracy for Illumina and Ion Proton platforms, respectively. Therefore, our filtering methods might be useful for large scale identification of common and rare variants associated with complex diseases.
References
Abyzov A, Li S, Kim DR, Mohiyuddin M, Stutz AM, Parrish NF, Mu XJ, Clark W, Chen K, Hurles M et al (2015) Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat Commun 6:7256
Boland JF, Chung CC, Roberson D, Mitchell J, Zhang X, Im KM, He J, Chanock SJ, Yeager M, Dean M (2013) The new sequencer on the block: comparison of Life Technology’s Proton sequencer to an Illumina HiSeq for whole-exome sequencing. Hum Genet 132:1153–1163
Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P et al (2010) Origins and functional impact of copy number variation in the human genome. Nature 464:704–712
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498
Gonzaga-Jauregui C, Lupski JR, Gibbs RA (2012) Human genome sequencing in health and disease. Annu Rev Med 63:35–61
Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, Mudge J, Miller NA, Hong D, Bell CJ et al (2009) A highly annotated whole-genome sequence of a Korean individual. Nature 460:1011–1015
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303
Merriman B, Ion Torrent R, Team D, Rothberg JM (2012) Progress in ion torrent semiconductor chip based sequencing. Electrophoresis 33:3397–3417
Motoike IN, Matsumoto M, Danjoh I, Katsuoka F, Kojima K, Nariai N, Sato Y, Yamaguchi-Kabata Y, Ito S, Kudo H et al (2014) Validation of multiple single nucleotide variation calls by additional exome analysis with a semiconductor sequencer to supplement data of whole-genome sequencing of a human population. BMC Genom 15:673
Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordonez GR, Bignell GR et al (2010) A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463:191–196
Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genom 13:341
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842
Rieber N, Zapatka M, Lasitschka B, Jones D, Northcott P, Hutter B, Jager N, Kool M, Taylor M, Lichter P et al (2013) Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS ONE 8:e66621
Rusmini M, Federici S, Caroli F, Grossi A, Baldi M, Obici L, Insalaco A, Tommasini A, Caorsi R, Gallo E et al (2016) Next-generation sequencing and its initial applications for molecular diagnosis of systemic auto-inflammatory diseases. Ann Rheum Dis 75:1550–1557
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M et al (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81
Tsai AG, Lieber MR (2010) Mechanisms of chromosomal rearrangement in the human genome. BMC Genom 11(Suppl 1):S1
van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426
Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J et al (2008) The diploid genome sequence of an Asian individual. Nature 456:60–65
Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE et al (2007) A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet 80:91–104
Zhang G, Wang J, Yang J, Li W, Deng Y, Li J, Huang J, Hu S, Zhang B (2015) Comparison and evaluation of two exome capture kits and sequencing platforms for variant calling. BMC Genom 16:581
Zong C, Lu S, Chapman AR, Xie XS (2012) Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338:1622–1626
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Young Ju Ahn declares that he has no conflict of interest. Kesavan Markkandan declares that he has no conflict of interest. In-Pyo Baek declares that he has no conflict of interest. Seyoung Mun declares that he has no conflict of interest. Wooseok Lee declares that he has no conflict of interest. Heui-Soo Kim declares that he has no conflict of interest. Kyudong Han declares that he has no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Ahn, Y.J., Markkandan, K., Baek, IP. et al. An efficient and tunable parameter to improve variant calling for whole genome and exome sequencing data. Genes Genom 40, 39–47 (2018). https://doi.org/10.1007/s13258-017-0608-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13258-017-0608-6