The big data revolution

Sequencing of the first human genome was completed in 2003, at a cost of almost three billion US dollars. In the 15 years that followed, the costs of whole genome sequencing have reduced remarkably, toward the well-known US$1000 target. This has been made possible in the most part through significant technological improvements and the implementation of next-generation sequencing (NGS). NGS platforms allow high-throughput and parallelisable DNA sequencing. These technologies generally utilise short read sequencing, followed by mapping of sequence reads against a reference genome in the analysis stage [reviewed in (Goodwin et al. 2016)]. Reductions in DNA sequencing costs have enabled the commencement of large-scale cancer genome sequencing projects. As cohort sizes have increased, data processing and storage requirements have necessarily become much more demanding. Hosting sequencing data for even a handful of whole human genomes requires hundreds of gigabytes of storage. Further, cancer genomics analyses often incorporate additional datasets from the fields of epigenomics and transcriptomics, thus increasing the complexity of such studies. These factors have enabled cancer genomics to join the big data revolution.

The Cancer Genome Atlas (TCGA) project was launched in 2005 and recently completed, having produced sequencing data from tumour and matched normal tissues from more than 30 cancer types (Tomczak et al. 2015). The International Cancer Genome Consortium (ICGC) commenced in 2008, similarly seeking to whole-genome sequence thousands of cancer samples and provide the data for research access (Zhang et al. 2011). Many processed datasets from these projects are ‘open access’, and raw datasets are generally available after application, for researchers to download and analyse for their own genomics projects. For researchers without the significant computational infrastructure that can be necessary to download and process datasets of these sizes, the National Cancer Institute (NCI) has sponsored the development of three cloud resources, which can enable scientists to analyse and visualise large datasets in a cloud environment (Hinkson et al. 2017).

Driver mutations and cancer development

It has been known for many years that cancer develops as a result of chromosomal abnormalities, and the specific mutation profile of a tumour has important implications for cancer treatment (Nowell 1976). Mutations can develop in cellular DNA through exposure to external DNA-damaging agents or from internal deficiencies in DNA replication or repair (Vogelstein et al. 2013). These processes result in the accumulation of potentially hundreds of thousands of somatic mutations in a single cancer genome, primarily taking the form of single nucleotide mutations, but also including insertions and deletions (indels) or larger structural rearrangements and copy number aberrations (Vogelstein et al. 2013). Of these somatic variants, only a handful will be responsible for malignant transformation, by conferring a selective advantage to the subpopulation of cells that harbour the variant (Tomasetti et al. 2015). Such mutations are termed ‘driver mutations’; they undergo positive selection in a tumour and cause cells to result in the hallmarks that are characteristic of malignancy (Hanahan and Weinberg 2011). Different cancer types harbour different numbers of driver mutations, averaging approximately four per tumour (Martincorena et al. 2017). The remaining variants are termed ‘passenger mutations’, and they confer little functional impact (Stratton et al. 2009). One of the challenges facing cancer genomics research is determining which are the handful of driver mutations from within the vast background of passenger mutations in a cancer genome. The focus of this review will be single nucleotide driver mutations, though we will address indels and larger structural rearrangements and copy number aberrations in some instances.

Types of driver mutations

Cancer develops when cells accumulate somatic mutations, as shown in Fig. 1. It is worth noting that germline variants can also contribute toward how the mutational landscape of a cancer develops [for examples, see (Waszak et al. 2017)], and can contribute to oncogenesis by predisposing cells toward cancer development.

Fig. 1
figure 1

Somatic driver mutations and cancer development. A simplified diagram depicting the process of somatic mutation accumulation and tumour formation from normal tissue

Protein-coding driver mutations

Most cancer driver mutations identified to date lie within gene bodies, and the function of these mutations can generally be ascertained by examining their impact on the encoded protein. Oncogenes are genes that are activated by mutations, allowing cells to acquire a selective advantage (Vogelstein et al. 2013) (Fig. 2a). In contrast, tumour suppressor genes contribute to cancer development through the selective advantage gained by their inactivation, which generally arises through truncating mutations or frameshift indels (Vogelstein et al. 2013) (Fig. 2b). Not all driver mutations have such clear function however. For example, synonymous mutations may also be driver events in cancer if they differentially regulate gene splicing (Supek et al. 2014). Similarly, larger structural variations and copy number aberrations such as genomic deletions may lead to gene fusion events that truncate tumour suppressor genes, or create tumourigenic novel proteins (Mertens et al. 2015). These genetic alterations can subsequently lead to dysregulation of important pathways, resulting in cancer development. When first published in 2004, the Cancer Gene Census [hosted by the Catalogue of Somatic Mutations in Cancer (COSMIC) database (Forbes et al. 2015; Forbes et al. 2011)] had annotated 291 well-characterised ‘cancer genes’ (Futreal et al. 2004). This list now contains more than 500 entries. Some driver genes are commonly mutated across cancer types, including TP53, ARID1A, KRAS and PIK3CA, while other driver genes are more tumour specific (Gonzalez-Perez et al. 2013).

Fig. 2
figure 2

Types of cancer driver mutations. Diagram depicting the formation of driver mutations in a oncogenes, b tumour suppressor genes and c non-coding regions. Bars denote exons, and red triangles depict an example pattern of somatic mutation accumulation across a cancer cohort

Non-coding driver mutations

Many germline variants associated with cancer and other diseases are situated in the non-coding genome (Maurano et al. 2012). In recent years, decreasing genome sequencing costs have enabled the identification of somatic cancer driver mutations in the ~ 98% of the genome that is non-coding. Far fewer non-coding than coding cancer driver mutations have so far been identified, with current examples generally impacting oncogenesis by altering cis-regulation (Fig. 2c).

Non-coding somatic driver mutations may impact transcription factor binding by removing an existing binding motif, or creating a de novo binding site and even an entirely novel regulatory element. For example, the promoter of the TERT gene is mutated in more than 50 cancer types [reviewed in (Bell et al. 2016)]. TERT promoter single nucleotide mutations create a transcription factor binding site that upregulates TERT expression, and were first described in melanoma (Horn et al. 2013; Huang et al. 2013). Other cancer driver mutations in promoter elements have since been discovered, mutating regulatory sites for cancer driver genes such as FOXA1 (Rheinbay et al. 2017). Indels are also able to alter gene cis-regulation by creating or removing transcription factor binding sites [for examples, see (Abraham et al. 2017; Mansour et al. 2014; Rahman et al. 2017)]. On a larger scale, structural variations and copy number aberrations can duplicate, remove or relocate cis-regulatory elements, leading to the dysregulation of enhancer-promoter interactions, and contributing to oncogenesis [for examples, see (Groschel et al. 2014; Zhang et al. 2016)]. In addition to these direct alterations to cis-regulatory elements, the nature of cis-regulation means that these sites are also susceptible to epigenetic dysregulation, through alterations to DNA methylation, nucleosome occupancy or the accessibility of chromatin [reviewed in (Poulos and Wong 2017); please also see this reference for a more comprehensive description of recent efforts undertaken to identify non-coding driver mutations in cancer genomes]. Non-coding driver mutations may also lie outside of cis-regulatory regions, affecting other genomic elements, such as long non-coding RNAs [for example, see (Lanzós et al. 2017)]. Further research efforts will be necessary to fully elucidate the role of non-coding mutations, which may have less clear impacts on cellular function.

Tools for annotating variants to identify driver mutations

A number of computational tools are available for the annotation of putative driver mutations. These tools typically assess a combination of measures in order to determine the likely functional impact of a given variant. Measures of function in the protein-coding genome generally focus on the impact that a somatic variant will have on protein translation, prioritising missense and nonsense mutations over synonymous variants. Measures of function in the non-coding genome generally consider conservation and transcription factor binding motifs, as well as epigenetic features. Table 1 briefly describes a selection of the tools available for the annotation of variants in either the protein-coding or non-coding genome. Many other tools are available for such variant annotation, and this list is not exhaustive. Ultimately, choosing the correct tool for a specific analysis will depend on the downstream applications required.

Table 1 Description of some of the tools available for the annotation of coding and non-coding variants identified from cancer sequencing data

Positive selection and driver identification

Defining positive selection

Negative selection is common in evolutionary history, but it is rare in cancer development, with only ~ 1% of protein-coding mutations undergoing negative selection in cancer (Martincorena et al. 2017). Instead, positive selection for driver mutations is much more common in oncogenesis. One method commonly used to detect genes undergoing positive selection in coding regions is analysis of the dN/dS ratio, which is a calculation of the ratio of non-synonymous (dN) to synonymous (dS) amino acid substitutions given a certain gene. Researchers can discover cancer driver genes by examining those genes that harbour an excess of non-synonymous mutations. Oncogenes and tumour suppressor genes generally harbour an excess of missense and nonsense mutations, respectively (Martincorena et al. 2017).

Here, we briefly discuss some of the tools that are available for analyses of positive selection in cancer DNA. OncodriveFML (Mularoni et al. 2016) detects positive selection in both coding and non-coding genomic regions by assessing mutation function. e-Driver (Porta-Pardo and Godzik 2014) and OncodriveCLUST (Tamborero et al. 2013a) similarly measure positive selection, specifically examining the internal distribution of variants within a gene to detect domains harbouring an excess of mutations. ActiveDriver (Reimand and Bader 2013; Reimand et al. 2013) is a statistical method that detects positive selection by analysing phosphorylation-associated variants. MuSiC (Dees et al. 2012) relies on measures of mutation recurrence, together with clinical and coverage data in order to statistically evaluate cancer sequencing datasets for potential drivers. Researchers using multiple complementary methods for these types of analyses should detect greater numbers of high-confidence cancer driver events (Tamborero et al. 2013b).

Establishing expected background mutation loads

Mutational processes do not act equally throughout the genome, and certain regions of DNA are more likely to acquire somatic mutations in cancer. For example, lowly expressed genes and regions of heterochromatin are less commonly subjected to transcription-dependent repair mechanisms, and such sites generally accumulate higher mutation loads (Schuster-Bockler and Lehner 2012; Zheng et al. 2014). Similarly, late replicating regions accumulate more mutations, likely due to mismatch repair being less active at such sites (Supek and Lehner 2015), exhaustion of the free nucleotide pool and/or difficulty navigating heterochromatin (Stamatoyannopoulos et al. 2009). Considering mutation rates at smaller scales, exons accumulate fewer mutations than intronic regions due to increased mismatch repair activity at such loci (Frigola et al. 2017). In addition, regions of transcription factor binding, such as at promoter elements or CTCF binding sites, acquire high mutation loads in some cancers because nucleotide excision repair machinery is inhibited from repairing mutagenic DNA lesions (Perera et al. 2016; Poulos et al. 2016; Sabarinathan et al. 2016). At nucleotide resolution, highly methylated cytosines are more often mutated in some cancers, due to the increased tendency for methylated cytosines to deaminate to thymine, and due to particular features of DNA replication and repair at such loci (Poulos et al. 2017).

Driver mutations confer a growth advantage, and they consequently undergo positive selection in a cellular subpopulation. However, accurate inferences of positive selection can be hindered by some of the mutation rate variations described here. It is vital for researchers to understand which combinations of these and other mutational processes may be operative in a given cancer genome. Analyses of this kind are particularly important because researchers typically use the recurrence of a mutation to determine the likelihood of its being a cancer driver, or to select cancer-associated genes. Such analyses can therefore lead to the false-positive identification of cancer driver mutations and genes which simply lie in highly mutated regions of the genome (Lawrence et al. 2013). It should be noted though, that even mutations accumulating due to increased mutability at certain loci may still be driver events. However, by accurately modelling the expected background mutation rates in a cohort under investigation, researchers should be better able to exclude spurious highly mutated regions, instead identifying true driver mutations and genes that will stand out from among the corrected background of passenger mutations.

One commonly used analytical method for calculating mutation rate variation is MutSigCV (Lawrence et al. 2013). This tool combines sample-specific mutation frequency with measures of gene-specific mutation rate, using gene expression and replication timing data (Lawrence et al. 2013). Similar methods have also been developed specifically for analyses of the non-coding genome — such as MutSigNC (Rheinbay et al. 2017) and LARVA (Lochovsky et al. 2015). These tools can assist researchers in the identification of genes that are mutated at low to intermediate frequencies. Though, saturation analyses have demonstrated that even with such models, highly mutated cancer cohorts could require thousands of samples of a single cancer type in order to accurately identify less frequently mutated driver genes (Lawrence et al. 2014).

Tumour heterogeneity and driver identification

Individual cells within a tumour will acquire mutations throughout their lifetime, and the resultant tumour mass will consist of a heterogeneous population of cells (Fig. 1). With the exception of data produced from single-cell sequencing applications, the results of cancer exome or genome sequencing will generally represent the combination of mutation profiles that were present within the subsection of tumour that was sequenced. These mutation profiles can theoretically be separated into distinct clones and subclones, revealing important insights into cancer pathogenesis, and specifically, which coding or non-coding mutations are the drivers that conferred a growth advantage. Research of this kind is particularly important when considering personalised cancer treatments, as mutations that are only present in a small subclone can become key drivers of cancer relapse (Schmitt et al. 2016). Subclones can be identified by analysing copy number-corrected variant allele frequencies for each of the somatic mutations present in a tumour. Mutations in distinct subclones will generally exhibit similar allele frequencies (Yates and Campbell 2012). Some of the tools available for the analysis of cancer clonality include ABSOLUTE (Carter et al. 2012), THetA (Oesper et al. 2013), SubcloneSeeker (Qiao et al. 2014), SciClone (Miller et al. 2014), PyClone (Roth et al. 2014) and SuperFreq (Flensburg et al. 2017). In order to study subclonal heterogeneity in a given cancer sample comprehensively, researchers may require sequencing data from multiple samples from an individual’s tumour [for example, see (Yates et al. 2015)].

Mutational signatures as clues in the cancer genome

One method for understanding and visualising the mutational processes operating in a cancer genome is to generate mutational signatures (Alexandrov et al. 2013a). Mutational signatures represent the frequencies of each type of mutation (C > A, C > G, C > T, T > A, T > C, T > G), together with their flanking nucleotides, and are presented as the counts of the 96 possible trinucleotide mutation combinations. To date, the COSMIC database (Forbes et al. 2015; Forbes et al. 2011) describes 30 distinct mutational signatures that have been identified in cancer samples so far, with each representing the action of a mutational process. For example, signatures have been identified that represent endogenous mutational processes such as defective DNA proofreading following Polymerase Epsilon (POLE) mutation (signature 10), deficient mismatch repair (signature 6) or the action of AID/APOBEC enzymes (signatures 2 and 13) (Alexandrov et al. 2013a). Mutational signatures have also been defined that result from exposure to exogenous mutagens such as cigarette smoke (signature 4) or ultraviolet light (signature 7) (Alexandrov et al. 2013a). A cancer genome will generally harbour mutations arising from a number of different mutational processes, each operating at differing intensities and/or over differing periods of time (Alexandrov et al. 2013b). The final mutational landscape will therefore be combinatorially affected by a number of mutational signatures (Alexandrov et al. 2013b).

By understanding the mutational signatures that are present in a particular cancer, researchers may gain insights into which driver mutation(s) might also be present in that tumour. For example, the presence of signature 10 will not only implicate a likely mutation in the exonuclease domain of POLE, but the modified trinucleotide mutation frequencies that result from POLE mutation may also predispose the cancer to gaining truncating mutations in APC or TP53 (Poulos et al. 2017). In another example, by analysing the DNA of cancers with large numbers of C > T mutations (associated with signature 1, following the deamination of methylated cytosines), researchers uncovered a germline mutation in the DNA glycosylase MBD4 that may predispose cells to subsequently developing certain driver mutations that accelerate oncogenesis (Sanders et al. 2017). Research associating mutational signatures with specific variants may uncover further mutated genes that are responsible for the generation of certain mutational profiles that drive cancer development.

Databases of driver mutations and cancer sequencing data

For researchers seeking robust lists of established cancer driver genes, there are a number of databases available for analysis. Two such databases are the Cancer Gene Census and IntOGen. As previously discussed, the COSMIC database (Forbes et al. 2015; Forbes et al. 2011) hosts the Cancer Gene Census (Futreal et al. 2004), which contains a list of genes, undergoing ongoing curation, that have been well established in cancer development (http://cancer.sanger.ac.uk/census/). Similarly, IntOGen (Gonzalez-Perez et al. 2013) is a web platform that uses annotation tools to provide lists of cancer drivers identified from large cancer sequencing datasets (https://www.intogen.org/). It is worth noting that well-established non-coding driver mutations are still rare in cancer research, and curated databases therefore primarily focus on protein-coding variants. Researchers intending to examine non-coding driver mutations may need to manually examine the literature for such examples [some current examples reviewed in (Cuykendall et al. 2017) and (Poulos and Wong 2017)].

Researchers can also interrogate databases of mutations that have been curated from large-scale cancer sequencing projects. TCGA data is stored at the Genomic Data Commons (GDC), which can be accessed at https://portal.gdc.cancer.gov/ (Grossman et al. 2016). ICGC data is stored at the ICGC Data Portal, which can be accessed at https://dcc.icgc.org/ (Zhang et al. 2011). Both websites provide user-friendly interfaces, allowing searches by gene, cancer type and mutation. Similarly, the COSMIC database (http://cancer.sanger.ac.uk/) contains records of somatic mutations identified in cancer, including manually curated expert data, as well as data from large sequencing projects such as TCGA and ICGC (Forbes et al. 2015; Forbes et al. 2011). cBioPortal (http://www.cbioportal.org/) is another resource that researchers can use to interrogate cancer genomics datasets, via a web interface that allows accessible data visualisation and analysis (Gao et al. 2013).

Future directions in cancer driver discovery

Through the advent of large-scale cancer sequencing projects, many new cancer driver genes and mutations have been identified. This endeavour has been greatly enhanced by the development of new analytical and statistical methods for selecting recurrently mutated loci with an excess of functional variants. However, driver mutations in many cancers have not yet been fully established. Many driver mutations likely lie within cancer driver genes that are yet to be identified (Martincorena et al. 2017), as well as within non-coding regions that have not yet been examined in sufficient detail due to limited sample sizes and availability of epigenomic datasets (Cuykendall et al. 2017; Poulos and Wong 2017). Such mutations may be detected as cancer cohort sizes increase.

The search for driver mutations in cancer genomes is a vital step in the move toward personalised approaches to cancer treatment. By identifying the molecular changes responsible for driving cancer, drugs can be designed that specifically target mutated or dysregulated genes. Further, by defining the mechanisms underlying the formation of such driver events, new strategies may be developed that prevent damage or even enhance repair to commonly mutated regions of DNA.