Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The advent of high-throughput technology has revolutionized biological sciences in the last two decades enabling experiments on the whole genome scale. Data from such large-scale experiments are interpreted at system’s level to understand the interplay among genome, transcriptome, epigenome, proteome, metabolome, and regulome. This has enhanced our ability to study disease systems, and the interplay between molecular data with clinical and epidemiological data, with habits, diet, and environment. A disproportionate amount of data has been generated in the last 5 years on disease genomes, especially using tumor tissues from different subsites, using high-throughput sequencing (HTS) instruments. Before elaborating the use of HTS technology in generating cancer-related data, it is important to describe briefly the history of DNA sequencing and the revolution of second and third generation of DNA sequencers that resulted in much of today’s data deluge.

2 Sequencing Revolution

The history of DNA sequencing goes back to the late 1970s when Maxam and Gilbert [1] and Sanger, Nicklen and Coulson [2] independently showed that a stretch of DNA can be sequenced either by using chemical modification method or by chain termination method using di-deoxy nucleotides, respectively. Maxam and Gilbert’s method of DNA sequencing did not gain popularity due to the usage of toxic chemicals and the di-deoxy chain termination method proposed by Professor Fred Sanger became the de facto standard and method of choice for researchers working in the field of DNA sequencing. Many of the present-day high-throughput next-generation sequencing methods (described later) use the same principle of sequencing-by-synthesis originally proposed by Sanger. The pace, ease, and automation of the process have since grown further with the advent of PCR and other incremental, yet significant, discoveries including introduction of error-free, high fidelity enzymes, use of modified nucleotides, and better optical detection devices. It is essentially the same technology, first proposed and used by Fred Sanger [2], with modifications that led to the completion of the first draft of the Human Genome Project [3, 4] that ushered in a new era of DNA sequencing.

The idea behind some of the first generation high-throughput sequencing (HTS) assays was to take a known chemistry (predominantly the Sanger’s sequencing-by-synthesis chemistry) and parallelize the assay to read hundreds of millions of growing chains of DNA rather than tens or hundreds as done with capillary Sanger sequencing. The processes for HTS comprise mainly of four distinct steps, template preparation, sequencing, image capture, and data analysis (Fig. 1). Different HTS platforms use different template preparation methods, chemistry to sequence DNA, and imaging technology that result in differences in throughput, accuracy, and running costs among platforms. As most imaging systems are not designed to detect single fluorescent events, clonal amplification of templates prior to imaging is incorporated as a part of template preparation before optical reading of the signal. In some cases, as in the case of single molecule sequencing, templates are not amplified but read directly to give base-level information. Some platforms are better suited than others for certain types of biological applications [5]. For discovery of actionable variants in tumors, accuracy is more important over all other parameters. Therefore, some HTS platforms are better suited to study tumor genomes over others. However, as the cost per base goes down, accuracy is increasingly achieved by higher coverage, thereby compensating errors with higher number of overlapping reads. The first publications on the human genome resequencing using the HTS system appeared in 2008 using pyrosequencing [6] and sequencing-by-synthesis using reversible terminator chemistry [7]. Since that time, the field that has gained the most amount of information using HTS platforms is cancer science. The discovery of novel DNA sequence variants in multiple cancer types using HTS platforms along with the advances in analytical methods has enabled us with the tools that have the potential to change the way cancer is currently diagnosed, treated, and managed.

Fig. 1
figure 1

Steps involved in HTS assays involving cancer patient samples and variant discovery, validation and interpretation

3 Primary Data Generation in Cancer Studies

Various steps involved in a typical high-throughput experiment involving cancer tissue are depicted in Fig. 1. Briefly, when the patient is admitted in the hospital, clinical, epidemiological and information on habits and previous diagnosis, and treatment (if any) is recorded. Any study involving human subjects must be preapproved by an institutional review/ethics board with informed consent from all participants. Following this, analytes, full history of patients, including information on habits, and previous diagnosis/treatment (if any) are collected. Then the patients undergo treatment (surgery/chemoradiation) and the tumor tissue is collected and stored properly till further use. Once the tumor/adjacent normal/blood is collected, nucleic acids are isolated, checked for quality, and used in library/target preparation for HTS or microarray experiments. Once the raw data is collected, the data is analyzed by computational and statistical means before being integrated with clinical and epidemiological features to come up with a set of biomarkers, which is then validated in a larger cohort of patients.

4 High-Throughput Data

HTS platforms generate terabytes of data per instrument per run per week. For example, the Illumina HiSeq 4000 can generate nearly 3 terabytes of data per run in 7 days (or >400 Gb of data per day). This pose challenges for data storage, analysis, sharing, interpreting, and archiving.

Although there are many different HTS instruments in the market, the bulk of the cancer data so far have been generated using the Illumina’s sequencing-by-synthesis chemistry. Therefore, a detailed description is provided on the data size, types, and complexity involved in cancer data generated by the Illumina instruments. Below is a description of different data types, usually produced during the course of a cancer high-throughput discovery study.

Despite the fact that the process of high-throughput data generation using Illumina sequencing instruments has become streamlined, never-the-less, there are inherent limitations on the quality of data generated. Some of the limitations are high degree of errors in sequencing reads (making some clinical test providers sequence up to 1000× coverage or more per nucleotide to attain the requisite accuracy), shorter sequencing reads (HiSeq series of instruments do not produce data with longer than 150 nt read length), the assay not interrogating the low-complexity regions of the genome, and higher per sample cost (to gain the requisite accuracy, one needs to spend thousands of dollars per sample even for a small gene panel test). Details on different data types generated by Illumina HiSeq instrument, their approximate sizes and file type descriptions are provided in Table 1.

Table 1 Different types of cancer data generated from a typical Illumina sequencing instrument and their descriptions.

5 Primary Data Analysis

The cancer data analysis schema is represented in Fig. 2. First, the raw image data from the sequencing instruments are converted into fastq format, which is considered as the primary data files for all subsequent analysis. Before analyzing the data, the quality of the fastq files is checked by using tools like FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), or with in-house scripts to reduce sequencing quality-related bias in subsequent analysis steps. Next, sequencing reads are aligned against a reference sequence. Broadly, the alignment tools fall into two major classes, depending on which of the indexing algorithm it uses: (hash table-based or Burrows Wheeler transformation (BWT)-based). Some of the commonly used alignment tools that use hash table-based approach are Bfast [8], Ssaha [9], Smalt [10], Stampy [11] and Novoalign [12] and the ones that are BWT-based are Bowtie [13], Bowtie2 [14], and BWA [15]. BWA is the most widely used aligner by the research community. Lately, many of the alignment programs are made parallel to gain speed [1619]. Most aligners report the results of the alignment in the form of Sequence Alignment/Map (SAM, and its binary form the BAM) format [20] that stores different flags associated with each read aligned. Before processing the aligned files for calling single (SNVs)—and/or multi (indels)—nucleotide variants, copy number variants (CNVs), and other structural variants (SVs), a few filtering, and quality checks on the SAM/BAM files are performed. These include removal of duplicates and reads mapped at multiple locations in the genome, realigning reads with known Indels, and recalibrating base quality scores with respect to the known SNVs. Once the SAM/BAM files are checked for quality, the files are used for variant calling. Although there are multiple tools for calling variants, the widely used popular one is the genome analysis toolkit (GATK) [21, 22] developed at the Broad Institute, USA. GATK implements variant quality score recalibration and posterior probability calculations to minimize the false positive rates in the pool of variants called [22]. Variants are stored in a file format called variant call format (VCF), which is used by various secondary and tertiary analysis tools. Another commonly used file format for cancer data analysis is called mutation annotation format (MAF), initially made to analyze data coming out from the cancer genome atlas (TCGA) consortium. The MAF format lists all the mutations and stores much more information about the variants and alignment than the VCF files.

Fig. 2
figure 2

Cancer data analysis schema

6 Secondary Data Analysis

Before secondary analysis, usually the PASS variants produced by GATK (standard call confidence >= 50) within specific genomic bait (used in exome or gene panels) are filtered and taken for further use. Tumor-specific variants are detected by filtering out the variants found in its corresponding/paired normal sample. During this process of calling tumor-specific variants, sequencing reads representing a particular variant in a tumor sample that have no corresponding reads in matched normal in the same location are ignored (using the callable filter of the variant caller). Then common SNPs (found in the normal population) are filtered out using a list of variants found in the databases like dbSNP and 1000 genome project. Therefore, only variants that are represented by sequencing reads both in tumor and its matched normal samples are considered. Optimization methods/workflows have been designed to analytically access the best combination of tools (both alignment and variant calling) to increase the sensitivity of variant detection [23]. The sensitivity of the alignment and variant calling tools are usually assessed by a set of metrics like aligner and variant caller-specific base quality plots of the variants called, transition/transversion (Ti/Tv) ratios, and SNP rediscovery rate using microarrays [23]. Further cross-contamination in tumor samples are assessed using tools like ContEst [24]. Searching variants against the known cancer-specific variants in databases like COSMIC [2527] is the first step to find out whether the variant/gene is unique/novel or have been found in the same or other cancer types previously. There are cancer-specific tools to perform annotation and functional analysis. The common annotation tools are ANNOVAR [28] and VEP [29]. CRAVAT [30] provides predictive scores for different types of variants (both somatic and germline) and annotations from published literature and databases. It uses a specific cancer database with the CHASM analysis option. Genes with a CHASM score of a certain value are considered significant for comparison with other functional analyses. IntoGen [31], MutSigCV [32], and MuSiC2 [33] are other tools that are used for annotation and functional analyses of somatic variants.

7 Data Validation, Visualization, and Interpretation

Once annotated, the cancer-specific genes are validated in the same discovery set and also using a larger set of validation samples. Validation is largely done using either an orthologous sequencing method/chemistry, mass-spec-based mutation detection methods, and/or using Sanger sequencing technique. Finally the validated variants are mapped to pathways using tools like Graphite Web [34] that employs both the topological and multivariate pathway analyses with an interactive network for data visualizations. Once the network of genes is obtained, the interactions are drawn using tools like CytoScape [3537]. Variants can also be visualized by using Circos [38], a cancer-specific portal like cbio portal [39] or with a viewer like the integrative genomics viewer (IGV) [40]. Finally, the genes that are altered in a specific cancer tissue are validated using functional screening methods using specific gene knockouts to understand their function and relationship with other genes.

8 High-Throughput Data on Human Cancers

The current projects on cancer genomics are aimed to produce a large amount of sequence information as primary output and information on variant data (somatic mutations, insertions and deletions, copy number variations and other structural variations in the genome). In order to analyze the large amount of data, high-performance compute clusters (HPC) with large memory and storage capacity are required. Additionally, higher frequency, high-throughput multi-core chips along with the ability to do high-volume data analysis in memory are often required. Due to the sheer number of files, and not just the size of the files, that need to be processed, the read/write capability is an important parameter for sequence analysis. For effective storage and analysis of sequencing and related metadata, network access storage systems, providing file-level access, are recommended. Additionally, there is a need for an effective database for data organization for easy access, management, and data update. Several data portals, primarily made by the large consortia are developed. Prominent among them are: The Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov/tcga/) data portal; cbio data portal [39] (developed at the Memorial Sloan-Kettering Cancer Center, http://www.cbioportal.org); the International Cancer Genome Consortium (ICGC) data portal (https://dcc.icgc.org); and the Sanger Institute’s Catalogue of Somatic Mutations in Cancer (COSMIC) database [25] portal (http://cancer.sanger.ac.uk/cosmic).

Although biological databases are created using many different platforms, the most common among them are MySQL and Oracle. MySQL is more popular database because of its open source. Although the consortia-led efforts (like TCGA and ICGC) have resulted in large and comprehensive databases covering most cancer types, the sites are not user-friendly and do not accept external data for integration and visualization. Therefore, efforts like cbio portal (http://www.cbioportal.org) are required to integrate data and user-friendly data search and retrieval. However, such efforts have to balance keeping in mind the cost and time required versus usability and additional value addition from the new database. The common databases use software systems known as Relational Database Management Systems (RDBMS) that use SQL (Structured Query Language) for querying and maintaining the databases. MySQL is a widely used open source RDBMS. Although most biological database uses MySQL or other RDBMS, it has its limitations as far as large data is concerned. First, big data is assumed to come in structured, semi-structured, and unstructured manner. Second, traditional SQL databases and other RDBMS lack ability to scale out a requirement for databases containing large amount of data. Third, RDBMS cannot scale out with inexpensive hardware. All these make RDBMS unsuitable for large data uses. This is primarily filled by other databases like NoSQL that are document-oriented graph databases that are non-relational, friendly to HPC environment, schema-less, and built to scale [41]. One of the important parameters in a database is the ability to take care of future increase in data size and complexity (Fig. 3), therefore having an ability to scale in both these parameters. Although it is a good idea to think of databases that have the ability to scale out, and accommodate variety and volume of future data increase, due to simplicity and ease of use, most small labs stick with MySQL database that uses variety of data, commonly used middleware and web server, and browser for data retrieval and visualization.

Fig. 3
figure 3

Two important parameters of big data and the place for an ideal database

9 Large-Scale Cancer Genome Projects

Advances in technology have fuelled interest in the cancer research community that has resulted in several large publicly funded consortia-based efforts to catalogue changes in primary tumors of various types. Some of the notable and prominent efforts in this direction are, The Cancer Genome Atlas (TCGA) project (http://www.cancergenome.nih.gov/), the International Cancer Genome Consortium (ICGC) project (https://icgc.org) [42], the Cancer Genome Project (http://www.sanger.ac.uk/genetics/CGP/), and the Therapeutically applicable Research to Generate Effective Treatments (http://target.cancer.gov/) project. The National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) of USA initially launched the TCGA as a pilot project in 2006 even before the first human resequencing work using HTS platforms was published. The TCGA effort plans to produce a comprehensive understanding of the molecular basis of cancer and currently has grown to include samples from more than 11,000 patients across 33 different cancer types. The ICGC is an international consortium that plans to obtain a comprehensive description of various molecular changes (genomic, transcriptomic and epigenomic) in 50 different tumor types and/or subtypes. ICGC currently has participants from 18 countries studying cancer samples from more than 12000 donors on 21 tumor types. All the consortia projects are producing substantial resource for the wider cancer research community.

Till date, HTS data on several cancer types have been generated and data analysis confirmed the presence of somatic mutations in important genes, significant changes in gene/miRNA expression, hyper- and hypo-methylation in gene promoters, and structural variations in the cancer genomes [32, 4366]. Additionally, comparative analysis of different analytical tools have been published for cancer data analysis [23, 6783]. Pan-cancer analyses projects have also have come up with specific and shared regions among different cancer types [32, 65, 8488].

10 Cancer Research-Specific Challenges

There are challenges related to HTS assays using tumor tissues. For a HTS assay to have clinical utility, several challenges need to be overcome. The challenges can be clinical, technical, biological, statistical, regulatory, and market-related and are outlined in Table 2.

Table 2 Challenges of making high-throughput assays, especially sequencing-based assays, meaningful in clinics

Clinical challenges: First among the clinical challenges is related to sample quantity. For retrospective studies to be meaningful, assays must be robust to use nucleic acids derived from formalin-fixed paraffin-embedded (FFPE) tissues. Tissue sections extracted are not often big to yield sufficient quantity of nucleic acids that can be used for sequencing, and validation studies even with the newer assays that use only tens of nanograms of nucleic acids as starting material. Even if one manages to get enough nucleic acids from the FFPE tissues, the quality of nucleic acids extracted is not the best of quality and is often fragmented. Additionally, chemical modifications like the presence of cross-links and depurination and the presence of certain impurities in the FFPE-extracted DNA make them less amenable to alterations required for high-throughput assays. Therefore, FFPE-extracted DNA can have a stronger influence on the HTS assays. Cancer tissues are heterogeneous [32, 89] and in certain cases extremely heterogeneous (for example in pancreatic adenocarcinoma) that cautions overinterpreting HTS data from a lump of tumor tissue as shown in metastatic renal-cell carcinoma [90]. Therefore, in heterogenous tumors, the mutational burden may be underestimated. Studying such intra-tumor heterogeneity may aid the case for combination therapeutic approaches in cancer [91]. Analytical methods have been devised in the past to detect tumor heterogeneity [73, 92, 93].

Biological challenges: The next challenge is biological where finding somatic mutations, especially, those present at very low frequency, among the sea of normal background is really difficult. The use of a matched normal sample for cancer sequencing is essential to find somatic variants but, at times, the matched normal tissue might be hard to get and therefore, variants found in lymphocytes from the same patients are often used as normal samples. Another problem in sequencing tumor tissue DNA is cross-contamination. Analytical tools have been developed to detect the level of cross-contamination in tumor tissues from both sequencing and array data [24, 94]. To overcome both the heterogeneity and the cross-contamination issue, the best way is to perform DNA/RNA sequencing derived from a single tumor cell. Single-cell genomics is likely to help and improve detection, progression, and prediction of therapeutic efficacy of cancer [95]. Several reports have been published on single-cell sequencing of different cancers and analytical tool development to analyze data from a single tumor cell [96108]. Although the problem of heterogeneity is overcome with single-cell sequencing, the fundamental questions may still linger, i.e., how many single cells have to be sequenced and if the signature is different in different single tumor cells. Additionally, there are limitations to the current protocols for isolation of single tumor cells and the inaccuracies involved in whole genome amplification of genomic DNA derived from a single cell. Therefore, capturing minute amounts of genetic material and amplifying them remain as one the greatest challenges in single cell genomics [109, 110].

Technical challenges: The third type of challenge is related to technical issues with current generation of sequencing instruments. Depending on the instrument in use, there could be an issue related to high error rate, length of the read, homopolymer stretches, and GC-rich regions in the genome. Additionally, accurate sequencing and assembling correct haplotype structures for certain regions of the genome, like the human leukocyte antigen (HLA) region, are challenging due to shorter read lengths generated in second generation DNA sequencers, presence of polymorphic exons and pseudogenes, and repeat rich region.

Statistical challenges: One of the biggest challenges to find driver mutations in cancer is related to sample number. Discovering rare driver mutations in cancer is extremely challenging, especially when sample numbers are not adequate. This, so-called, “the long tail phenomenon” is quite common in many of the cancer genome sequencing studies. Discovering rare driver mutations (found at 2 % frequency or lower) requires sequencing a large number of samples. For example, in head and neck cancer, imputations have shown that it will take 2000 tumor:normal samples to be sequenced at 90 % power in 90 % of the genes to find somatic variants present at 2 % frequency or higher [43].

Regulatory and other challenges: In order for cancer personalized medicine to become a reality, proper regulatory and policy framework need to be in place. Issues around how to deal with germline changes along with strict and proper assay and technical controls/standards are needed to be in place to assess biological, clinical, and technical accuracy and authenticity. A great beginning in this direction has already been made by the genome in a bottle consortium (https://sites.stanford.edu/abms/giab) hosted by the National Institute of Standards and Technology of the USA that has come up with reference materials (reference standards, reference methods, and reference data) to be used in sequencing. Finally, in order for cutting edge genomic tests to become a reality, collaboration and cooperation between academic centers and industry are absolutely necessary [111]. Additionally, acceptability criteria and proper pricing control mechanism(s) need to be in place by the government. This is necessary for countries like India where genomic tests are largely unregulated.

11 Conclusion

Cancer research has changed since the introduction of technologies like DNA microarray and high-throughput sequencing. It is now possible to get a genome-wide view on a particular tumor rather than looking at a handful of genes. The biggest challenge for finding actionable variants in cancer remains at the level of data analysis and understanding of their functional importance. Recent demonstrations [112115] of gene editing systems like CRISPR-Cas9 in understanding the function of cancer-related genes and their role(s) in carcinogenesis and metastasis will play a big role in the future. Further, high-throughput sequencing technology can be used to providing information on individual cancer regulome by integrating information on genetic variants, transcript variants, regulatory proteins binding to DNA and RNA, DNA and protein methylation, and metabolites. Finally, for big data to bear fruits in cancer diagnosis, prognosis, and treatment, processes like; simplified data analytics platforms; accurate sequencing chemistry; standards for measuring clinical accuracy, precision and sensitivity; proper country-specific regulatory guidelines and stringent yet ethical framework against data misuse; need to be in place [111].