Introduction

Genome-wide sequencing aims to provide a deep understanding of sequence variations as a foundation for investigating the relationship between genotype and phenotype. Thus, the 1000 genomes project (http://www.1000genomes.org) has yielded important insights into the study of population genetics, including the investigation of causal variants of genes for various human syndromes (Rusmini et al. 2016). In recent years, next-generation sequencing (NGS) technology is evolving rapidly and several new sequencing platforms were released. Whole-genome sequencing (WGS) has comprehensively been used for detecting genomic and genetic variations such as single-nucleotide variants (SNVs) (Kim et al. 2009; Wang et al. 2008; Zong et al. 2012), copy-number variations (CNVs) (Conrad et al. 2010; Sebat et al. 2004; Wong et al. 2007; Zong et al. 2012), insertions and deletions (InDels) (Abyzov et al. 2015; Pleasance et al. 2010; Sudmant et al. 2015), and chromosomal rearrangements (Tsai and Lieber 2010). However, the low cost, high sequencing coverage and more readily interpreting protein-coding exons associated with whole-exome sequencing (WES) platform when compared to the WGS platform, has led to more exome sequencing than genome sequencing (Gonzaga-Jauregui et al. 2012).

In 2013, most WES studies are performed on Illumina’s HiSeq while the Thermo Fisher’s SOLiD and Roche 454 platforms account for a small share of activities. The development of semiconductor-based sequencing has recently emerged as an alternate sequencing platform with the commercial release of the personal genome machine (PGM) by life technologies (Gonzaga-Jauregui et al. 2012; Merriman et al. 2012; Quail et al. 2012; van Dijk et al. 2014). It has potential to be cost-effective with a rapid turnaround time, perhaps faster than other current methods that rely on optical imaging and specialized fluorescent-labeled nucleotides. Although the PGM produces megabases (Mbs) of DNA sequence within hours, its net output is insufficient to generate WES data with adequately high coverage per base. Later, gigabases (Gbs) of data has been acquired, thus enabling investigators to conduct high quality WES, transcriptomes and, eventually, WGS rapidly at low cost after Thermo Fisher released the semiconductor sequencing instrument, Ion Proton.

Previously, the variant calling performance of the Ion Proton sequencer was assessed by comparing it with variants called by HiSeq and MiSeq sequencers (Boland et al. 2013; Motoike et al. 2014; Quail et al. 2012; Zhang et al. 2015). In this study, we evaluated the performance of the Ion Proton semiconductor-based sequencer for WES and Illumina HiSeq 2500 for WGS based on the accuracy of the variation data and the GC rich/poor regions of the Korean individual genomes. The variants called by these platforms are validated with Sanger sequencing. In addition, 59 genes belonging to the American College of Medical Genetics and Genomics (ACMG) were included.

Materials and methods

Sample collection and DNA preparation

The samples used in our study came from ten Korean male and female individuals collected from KPGP (Korean Personal Genome Project). Blood samples were collected and genomic DNA (gDNA) extracted using DNeasy Blood & Tissue Kit (Qiagen, CA). DNA was quality-checked on agarose gels and quantified using a micro volume spectrophotometer (NanoDrop 2000).

Exome capture and sequencing with Ion Proton

Exome capture for all ten samples was carried out using a commercially available in-solution method (Ion AmpliSeq Exome Kit), following the manufacturer’s guidelines. The gDNA (50 ng) was used for library preparation using the Ion AmpliSeq Library Kit 2.0 kit (Thermo Fisher Scientific, USA). The DNA templates for sequencing were prepared on the Ion Chef system using the Ion PI™ Hi-Q™ Chef Kit. Sequencing was carried out on the Ion Proton™ (Thermo Fisher, Carlsbad, California, USA) at TheragenEtex, South Korea.

Whole-genome sequencing with HiSeq2500

After quality control to ensure lack of genomic degradation, DNA samples (1 μg) were subjected to library preparation. Libraries with short inserts of 350–450 bp for paired-end reads were prepared using Truseq DNA sample prep kit following the manufacturer’s protocol for Illumina. The whole genome DNA was sequenced using Illumina HiSeq 2500, including an adaptation of the pairwise end-sequencing strategy.

Post-sequencing analysis (read alignment, coverage, variant calling)

Burrows-Wheeler aligner [BWA,v0.6.2; (Li and Durbin 2009)] was used to align the sequencing reads, with default parameters, to the Genome Reference Consortium assembly, GRCh37. Alignments were converted from sequence alignment map (SAM) format to sorted, indexed binary alignment map (BAM) files (SAMtools v0.1.18). The Picard tool was used to remove duplicate reads. We used the BEDtools (v2.26.0) to intersect the aligned genomic regions overlapping between WGS and WES BAM files (Quinlan and Hall 2010). Genome Analysis Tool Kit (GATK) software tools (v2.3) were used for improvement of alignments and genotype calling and refining with default parameters (DePristo et al. 2011). Genotypes were called by the GATK UnifiedGenotyper, and the GATK VariantRecalibrator tool was used to score variant calls by a machine-learning algorithm and to identify a set of high-quality SNPs using the Variant Quality Score Recalibration (VQSR) procedure with default parameters (McKenna et al. 2010). In addition, all the variants from both platforms were functionally annotated using databases from Korean Center for Disease Control (KCDC), American College of Medical Genetics (ACMG), Exome Aggregation Consortium (ExAC), 1000 Genomes Project, Exome Variant Server (6500), and GC content in order to identify common variations. In the last filtering step, markers fall below 40% and above 60% GC content were selected for subsequent downstream analysis. The whole analysis process was shown in Fig. 1.

Fig. 1
figure 1

Flowchart summarizing the process applied for the identification of variants based on the Illumina and Ion Proton sequencing data. BWA Burrows-Wheeler Aligner; GATK Genome Analysis Toolkit

Identification of GC rich/poor regions using RefGene

The coding regions corresponding to each transcript were counted using RefGene. The GC ratio of each gene was calculated by removing overlapping transcripts and counting number of G and C base in the exon (coding) regions (Rieber et al. 2013). The GC distribution plot was drawn using R package (Fig. 2).

Fig. 2
figure 2

Frequency histogram of GC contents from the VQSR filtered data. The shadow region marked in-between 40 and 60 were omitted during analysis. X-axis the total GC contents in each transcript; Y-axis the number of G or C base corresponding to RefGene

Validation of variants by Sanger sequencing

Variants for Sanger sequencing validation were randomly selected from the following filtering procedure: filter column ⇒ pass, ANN.Imapct ⇒ morderate, rsID ⇒ yes, exac : more than average, 1000 Genomes: Caucasian, African high frequency (more than average) filter out, GC_contents ⇒ high (maximum) & low (minimum), KCDC DB frequency ⇒ 0.3–1 homo (10%), hetero (90%, 0/1). PCR primers were designed using Primer 3 software, to produce amplicons (ranging in size from 100 to 200 bp) containing variants of interest in approximately the center of the amplicon. PCR was performed in a total volume of 10 μl containing 5 μl AmpliTaq Gold®Fast PCR Master Mix (Applied Biosystems), 1.5 μl of each primer with a concentration of 0.5 pmol/μl, and 2 μl genomic DNA in a concentration of 40 ng/μl. After quality-control steps using agarose gel, the product was purified and pooled. Final PCR products were quantified using Qubit (Invitrogen, USA), followed by conventional Sanger sequencing was performed.

Results

Sequencing, coverage, and alignment statistics

In this study, ten Korean individual samples were subjected to high-throughput sequencing using two different platforms, in order to identify the sensitivity and completeness of variants. Whole-genome sequencing (WGS) was performed with 101 bp paired-end sequences to ~30X coverage using Illumina HiSeq 2500 and whole-exome sequencing (WES) was performed by exon capture with 170 bp to a ~100-fold coverage using Ion Proton sequencer as described in materials and methods. A summary of the Illumina and Ion Proton derived data for ten Korean individuals is shown in Table 1. We aligned reads from both platforms to the human reference GRCh37, henceforth referred to as hg19. Overall, a total of 10,130,712,476 reads from Illumina and 391,868,380 reads from Ion Proton, were mapped to the hg19 using the software, BWA (Li and Durbin 2009). For all samples, on average 95% of the reads could be mapped to the human reference genome (hg19) (Table 1). Analysis of raw binary alignment map (BAM) files revealed that sequence reads covered more than 94 and 89% of the whole genome for Illumina and Ion Proton, respectively, with a depth of greater than ten reads per base.

Table 1 Overview of basic statistics from Illumina (bold) and Ion Proton (italic) sequencing platforms

Variant calling (Illumina vs. Ion Proton)

We sought to compare the sensitivity and accuracy of each platform for SNV calling. The overlapping regions between two different sequencing data (WGS and WES) were computed by using BEDtools (Quinlan and Hall 2010) implicated in the further variant analysis. Software suites, such as the GATK (McKenna et al. 2010), have been designed to manage large-scale sequencing projects. We used GATK UnifiedGenotyper tool to call multiple variants for genetic variants from both platforms. Targeted realignment and base recalibration were performed using the GATK tools as explained in “Materials and methods”. Multiple variant calling for ten samples produced 54,766 and 90,057 SNVs from Illumina and Ion Proton, respectively (Supplementary file S1 and S2). As a result of variant analysis with two different sequencing methods, considering read coverage and mapping quality, we found more SNVs in WES data using Ion Proton because variant calling with unusable and limited read coverage of data were eliminated more from WGS data compared with WES data holding high-genome coverage. GATK’s best practices include a variant filtering step following Variant Quality Score Recalibration (VQSR). This “VQSR filter” uses annotation metrics, such as quality by depth, mapping quality, variant position within reads, and strand bias, from “true” variants to generate an adaptive error model. It then applies this model to the remaining variants for calculating a probability that each variant is real. Using VQSR, we further filtered out multiple raw variants. Hence, VQSR filter count produced 49,979 (91.2%) and 73,417 (81.5%) SNVs from Illumina and Ion Proton, respectively (Fig. 3). These SNVs were further clustered based on depth coverage (>10) and we found that 8.24 and 3.0% SNVs were filtered out from Illumina and Ion Proton, respectively.

Fig. 3
figure 3

The number of variants per chromosome identified from Illumina and Ion Proton platforms

Functional annotation of variants

We used various databases including KCDC, ACMG, ExAC, 1000 Genomes Project, and Exome Variant Server (6500), in order to provide accurate functional annotations of variants. Based on the results, we applied following filtering options: “VQSR = “PASS”, 1000 Genome Asian 0.3<, KCDC 0.3<, GC region > 40, 60< (Fig. 2), Impact: High, Moderate” and identified that 532 SNVs were shown as intersection between Illumina (707 SNVs) and Ion Proton (737 SNVs) multiple variants (Fig. 4; Supplementary file S3). In addition, we found that the number of SNVs was drastically reduced because the GC regions between 40 and 60% were filtered out.

Fig. 4
figure 4

Number of variants called from detailed filtering criteria. The Venn diagram depicted the number of variants identified from GC rich/poor regions after filtering. A total of 707 and 737 variants were identified from Illumina and Ion Proton data set, respectively, comprising 532 common variants

Marker filtering and experimental validation of SNVs

To investigate the accuracy of each sequencing platform, we selected ten variants from Illumina only, Ion Proton only, and common from both platforms using the following filtering options: filter coloum ⇒ pass, ANN.Imapct ⇒ morderate, rsID ⇒ yes, exac: more than average, 1000 Genomes: Caucasian, African high frequency (more than average) filter out, GC contents ⇒ high (maximum) & low (minimum), KCDC DB frequency ⇒ 0.3–1 homo (10%), hetero (90%, 0/1). We have validated each of the ten variants that were obtained from WGS and WES using Sanger sequencing. Our objective of Sanger sequencing validation was to confirm the filtered variants in order to identify heterozygosity status of variants. We designed a total of 30 primer pairs for Illumina, Ion Proton and common variants to amplify each of the ten variants from the coding region of different genes (Supplementary file S4). When comparing the results of common variants and Sanger sequencing, we could not find any false-positive. It was due to achieving optimum sequencing depth for both WGS and WES. Compared to the human reference genome (hg19), we found that 98 and 94% of the Illumina and Ion Proton variants, respectively, showed exact matches in Sanger sequencing validation (Fig. 5). In case of the SNP in RNF222 (G>A), homozygous variant (A/A) was found in 7, whereas heterozygous variant (A/C) was observed in 3 of the 10 Proton samples. On the other hand, in case of the SNP in KRTAP10-1 (G>A), eight samples showed homozygous and two exhibited heterozygous variant (G/T) (Fig. 5; Table 2). In summary, Sanger sequencing validation revealed that the Illumina showed more accuracy than the Ion Proton platform.

Fig. 5
figure 5

Ion Proton and Illumina derived variant verification by Sanger sequencing. a Ion Proton (rs12601265), b Ion Proton (rs233319), and c Illumina (rs233316)

Table 2 Results for the final 10 prioritized variants regarding Illumina, Ion Proton, and intersection validated through Sanger sequencing

Discussion

To gain fast and cost-effective NGS sequencing and analysis of the human genome, it is important that sampling and library preparation steps must be simplified, but also to carry out DNA sequencing on a single technology platform. Moreover, with the availability of a high quality reference genome, resequencing with “low-cost per base” technologies have been accomplished by many commercial sequencing companies. However, with hundreds of human genomes currently being deciphered, the question of whether variant filtering may provide the accuracy of alternative alleles or not. Our analysis on the accuracy of SNVs suggests that a significant number of variants display higher accuracy when the data compared from two sequencing platforms.

We realize that all the NGS service providers may postulate “best practice” protocol from sample-prep to the bioinformatics pipeline, and that include different analyzing strategy. Further, platform related issues, coverage related issues, and SNP related issues (false-positives and quality) may hindered the process of variant calling. In this study we included BWA + GATK and VQSR for variant calling in our pipeline. Though it’s a best practice following worldwide, we focused on criteria based on GC rich/poor regions. Those variants falling between 40–60% GC regions were omitted and 707 and 737 variants were finally selected from Illumina and Ion Proton data set, respectively, of ten Korean individuals. In addition, using a Sanger sequencing approach we report that the validation of SNP calls is significantly improved for variants that are reported from WGS and WES platforms.

Conclusions

In this study, we have examined the accuracy of variant calling using Illumina’s HiSeq2500 and Thermo Fisher’s Ion Proton technologies. We showed the strengths and weaknesses of each technology with respect to multiple SNV calling sensitivity and specificity through detailed filtering strategies. We significantly extended earlier filtering options for variant calling by including GC rich and poor regions. Sanger sequencing validation revealed 98 and 94% accuracy for Illumina and Ion Proton platforms, respectively. Therefore, our filtering methods might be useful for large scale identification of common and rare variants associated with complex diseases.