Large-scale analysis of whole genome sequencing data from formalin-fixed paraffin-embedded cancer specimens demonstrates preservation of clinical utility

Basyuni, Shadi; Heskin, Laura; Degasperi, Andrea; Black, Daniella; Koh, Gene C. C.; Chmelova, Lucia; Rinaldi, Giuseppe; Bell, Steven; Grybowicz, Louise; Elgar, Greg; Memari, Yasin; Robbe, Pauline; Kingsbury, Zoya; Caldas, Carlos; Abraham, Jean; Schuh, Anna; Jones, Louise; Tischkowitz, Marc; Brown, Matthew A.; Davies, Helen R.; Nik-Zainal, Serena

doi:10.1038/s41467-024-51577-2

Large-scale analysis of whole genome sequencing data from formalin-fixed paraffin-embedded cancer specimens demonstrates preservation of clinical utility

Article
Open access
Published: 04 September 2024

Volume 15, article number 7731, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Large-scale analysis of whole genome sequencing data from formalin-fixed paraffin-embedded cancer specimens demonstrates preservation of clinical utility

Download PDF

654 Accesses
22 Altmetric
Explore all metrics

Abstract

Whole genome sequencing (WGS) provides comprehensive, individualised cancer genomic information. However, routine tumour biopsies are formalin-fixed and paraffin-embedded (FFPE), damaging DNA, historically limiting their use in WGS. Here we analyse FFPE cancer WGS datasets from England’s 100,000 Genomes Project, comparing 578 FFPE samples with 11,014 fresh frozen (FF) samples across multiple tumour types. We use an approach that characterises rather than discards artefacts. We identify three artefactual signatures, including one known (SBS57) and two previously uncharacterised (SBS FFPE, ID FFPE), and develop an “FFPEImpact” score that quantifies sample artefacts. Despite inferior sequencing quality, FFPE-derived data identifies clinically-actionable variants, mutational signatures and permits algorithmic stratification. Matched FF/FFPE validation cohorts shows good concordance while acknowledging SBS, ID and copy-number artefacts. While FF-derived WGS data remains the gold standard, FFPE-samples can be used for WGS if required, using analytical advancements developed here, potentially democratising whole cancer genomics to many.

Cross-laboratory validation of the OncoScan® FFPE Assay, a multiplex tool for whole genome tumour profiling

Article Open access 18 February 2015

Development and validation of a whole-exome sequencing test for simultaneous detection of point mutations, indels and copy-number alterations for precision cancer care

Article Open access 20 July 2016

High-Resolution Copy Number Patterns From Clinically Relevant FFPE Material

Article Open access 20 June 2019

Introduction

Genome sequencing technologies have advanced our understanding of cancer, providing more personalised management options¹. Until recently, targeted sequencing strategies, limited to several hundred known cancer-associated genes, have been the mainstay of oncology^2,3,4. However, whole genome sequencing (WGS) carries several thousand-fold more data per patient, providing a holistic picture of the cancer genomic state, revealing a multitude of highly individualised, clinically valuable genomic characteristics missed by targeted approaches^5,6.

Recent advances in sequencing methodologies have significantly reduced the base cost of sequencing, contributing to the potential for more widespread adoption of WGS in research and clinical settings^7,8,9,10. Yet, a critical barrier preventing utility in the clinical setting is that tumour samples are routinely formalin-fixed and paraffin-embedded (FFPE) for histopathological analysis. The FFPE process causes considerable DNA damage, precluding WGS, which hitherto has been preferentially conducted on fresh frozen (FF) specimens where DNA integrity is preserved. Additionally, the logistical demands associated with snap freezing, including requiring liquid nitrogen equipment and specialist sample handling, prohibit tumour collections through routine diagnostic pathways, particularly in non-tertiary hospital settings and lower-resourced healthcare systems. Targeted genomic approaches are possible on FFPE specimens. Thus they have become the preferred cancer genomic option over the last decade.

Furthermore, previous evaluations of WGS data from FFPE specimens have revealed that most FFPE artefacts are detected at frequencies exceeding background noise (~1% of sequencing reads) but <10%^11,12,13,14. This led to bioinformatic solutions, including filtering variants with allelic fractions of 10% or less. However, this approach leads to the exclusion of genuine mutations, including clinically actionable variants present at low variant allelic fractions (VAFs). Consequently, the value of FFPE WGS has been called into question¹⁵. Moreover, reports of recurrent false positives have further hindered clinical adoption—for example, a potentially actionable mutation, EGFR T790M, has been ascribed as an FFPE-mediated artefact¹⁶. Mutational signatures have been proposed as a means to identify and quantify artefact¹⁷; however, they have yet to be comprehensively examined in a substantially powered dataset or validated in clinical cohorts. WGS of FFPE-derived material was thus considered off-limits for many years, with FF or RNAlater-stored¹⁸ samples being the gold standard for WGS.

Here, we set out to explore whether it was possible to salvage WGS data by analysing a large cohort of FFPE-derived specimens. This is driven by the need to extend comprehensive genomics beyond the confines of specialised, well-resourced centres. In addition, there are clinical scenarios where it is simply not possible to obtain FF material, or where the decision for WGS is an ex post facto justification (i.e., one that is made retroactively following, for example, histopathological diagnosis or failure of conventional treatment) and further specimen collections are not possible. Lastly, enabling comprehensive, genome-wide analytics on FFPE-derived tumour material would considerably facilitate exploratory analyses on less well-studied populations, potentially democratising cancer genomics to more diverse cohorts worldwide¹⁹.

Results

We examined a cohort comprising 578 WGS FFPE-stored treatment naïve primary cancer samples from nine tumour-types and matched germline samples derived from peripheral blood mononuclear cells. These were compared to 11,014 WGS FF samples in the discovery phase. All patients were diagnosed with cancer in England’s National Health Service and underwent a primary surgical procedure with curative intent without prior adjuvant therapy. The following tumour-types were available for the study: breast, central nervous system (CNS), colorectal, kidney, lung, ovary, prostate, uterus, and bladder (Supplementary Table 1). A fraction of FF tumour samples underwent polymerase chain reaction (PCR) during library preparation (n = 899, 8% of FF samples, herewith referred to as FF(PCR)). All samples were taken through the standard Genomics England (GEL) WGS bioinformatic pipeline, including alignment and somatic-variant calling using default parameters (Supplementary Methods Section S1).

Quality of sequencing data

Quality of sequencing data was assessed using customary coverage and alignment metrics (Fig. 1A; Supplementary Table 5 and Supplementary Data 1). Compared to FF-derived samples, FFPE-derived samples yielded data of poorer quality, with smaller insert sizes (391 base pairs vs. 477 base pairs; p < 0.0001) and a higher percentage of chimeric DNA fragments (0.51% vs. 0.26%; p < 0.0001), indicative of damaged DNA templates. Mapping of FFPE-derived short reads to the reference genome was marginally reduced (93.4% vs. 94.1%; p < 0.0001). There was increased heterogeneity in genome coverage with a bias towards GC over-representation and consequential depletion of AT sequences. These differences were not due to factors such as cancer cell content or tumour type (Supplementary Fig. 1). FF(PCR) metrics interestingly resembled PCR-free FF libraries, implying that the primary cause for poorer sequencing quality in FFPE-derived DNA libraries is the quality of the starting material, and not due to the library-preparation process.

Distributions of VAFs for substitutions and indels were normalised to cancer cell content (nVAF) (Fig. 1B). FF substitution and indel variants demonstrated a peak at nVAF of 0.5, in keeping with the expected heterozygosity of most variants. By contrast, variants from FF(PCR) and FFPE samples showed the tallest peaks at nVAF of 0.1, with a modest peak at nVAF of 0.5 in keeping with PCR-related and/or DNA-damage-induced artefacts. Substitution and structural variation burdens were not systematically increased in FFPE libraries across all tissue types, unlike indel burdens which were increased by order of magnitude irrespective of tumour-type (Fig. 1C). The substantial excess of indels was similarly observed in FF(PCR) libraries, suggesting that the source of indel artefacts could be PCR-related during library preparation.

Drivers and actionable variants

Domain 1 somatic variants are defined by Genomics England as variants in a virtual panel of potentially actionable genes (168 genes listed in Actionable genes in solid tumour v2; Supplementary Methods Section S1.5). Detection of Domain 1 variants was uncompromised in FFPE-derived samples. There were no significant differences in the prevalence of Domain 1 events (e.g., breast (p = 0.59), colorectal (p = 0.37), lung (p = 0.96), uterus (p = 0.17)) between FFPE, FF, and FF(PCR) cohorts (Fig. 2A; Supplementary Fig. 3). As actionable mutations are of immediate clinical value, specific variants were explored. For example, variants associated with gefitinib response in lung cancers were not differentially represented in FFPE-derived WGS samples. EGFR variants indicating gefitinib sensitivity (L858R, G719S, or exon 19 deletions) were present in 8.1% (104) of FF lung samples, 8.6% (9) of FF (PCR) samples and 14.1% (9) of FFPE samples. The marginally higher proportion of these EGFR variants in FFPE samples might be attributed to variations in histological subtypes of lung cancer. However, this observation should be interpreted with caution due to the limited sample size. Similarly, the KRAS G12C mutation associated with gefitinib resistance was identified in 10.5% (135) of FF samples, 11.5% (12) of FF (PCR) samples, and 7.8% (5) of FFPE samples. Assessing allelic frequency and relative cancer cell content for each of these variants showed no inter-group biases (Fig. 2B, Supplementary Table 6), allaying concerns regarding false positives, as unlike true mutations, artefact would not be expected to correlate with cancer cell content. These findings were consistently observed for actionable variants in other tissue types (e.g., PIK3CA in breast cancer, BRAF V600E mutation in all organs) (Fig. 2B). Critically, the analysis highlights how previous attempts at bioinformatic filtering of VAF < 10% in FFPE samples would result in the exclusion of true clinical actionable variants. Indeed, 125 (7.7%) of PIK3CA and BRAF V600E mutations occurred at a VAF < 10% and would have been discarded using VAF filtering. The EGFR T790M variant previously purported as an FFPE-mediated artefact was not identified in any of the 578 samples examined, adding further reassurance. Copy number driver events were compared between FF and FFPE groups. Clinically relevant amplification drivers such as ERBB2 in breast cancers (FF: 7.7%, FFPE:4.7%), GNAS (20q13) in colorectal cancers (FF: 6.4%, FFPE: 4.5%), and CCND1 (FF: 6.9%, FFPE: 8.5%) were comparably retrieved. The calling of driver homozygous deletions was compromised in ~45% of FFPE samples and 34% of FF samples due to an excess of copy number segmentation (or ‘hypersegmentation’), a known issue with copy number calling in general, and was not specific to FFPE samples. In the remaining samples, homozygous deletions in three clinically important tumour suppressor genes, CDKN2A, MAP2K4, and PTEN occurred at analogous percentages in FF and FFPE samples (Supplementary Tables 7 and 8). No actionable gene fusions were identified in this FFPE cohort.

**Fig. 2: Comparison of putative driver events in Genomics England Cohort.**

Mutational signature analysis defines FFPE-related signatures

Given the extensive size of the FFPE cohort, we were able to detect and comprehensively characterise artefactual mutational signatures. Both known and previously uncharacterised artefactual patterns were discerned (Supplementary Data 2). Two substitution signatures associated with FFPE were identified (Fig. 3A; Supplementary Data 2). The first, SBS57, was previously postulated to be an artefactual signature²⁰, although not overtly associated with FFPE. The other, a hitherto undescribed pattern (SBS FFPE), is characterised by T > C mutations (in particular at ATA, ATC, CTT, and TTT (mutated base underlined)) as well as C > T mutations (in particular at ACA, ACT, and TCT). 64.2% of FFPE samples carried either SBS57 and/or SBS FFPE (371/578). More than half of the FFPE samples contained SBS57 (55.36%, 320). The proportion of SBS57 per sample tended to decrease as the total substitution count increased. By contrast, SBS FFPE was identified in fewer samples (17.12%, 99), appeared to be idiosyncratic, and tended to dominate the mutational landscape of affected samples, presenting with a ‘hypermutator’ phenotype (Fig. 3B). We were unable to find associations between SBS FFPE and any other feature, such as extraction protocol or tumour-type. SBS57 was also detected in a small fraction of FF samples that underwent PCR during library preparation (Supplementary Table 9 and Supplementary Data 5). The absence of SBS FFPE in these FF samples with PCR suggests that it is a manifestation of FFPE fixation. We additionally found a new pathognomonic indel signature characterised by 1 base pair A/T insertions and deletions at long polyA/T and dinucleotide deletions at long repeats, accompanied by long insertions at non-repeats (Fig. 3C; Supplementary Data 2). This signature, herewith referred to as ID FFPE, was present in 99.7% (576) of FFPE samples and also followed the pattern of decreasing proportional contribution as total indel count increased (Fig. 3D). Notably, all common biological signatures typically found in FF samples were identified in FFPE samples even when artefactual signatures were present (Supplementary Datas 3, 4). There was a systematic misassignment of SBS3 and SBS8, Homologous Recombination (HR)-deficiency-associated substitution signatures, which do not have distinctive peaks and are relatively flat and featureless. These were systematically over-called in samples that do not necessarily have biological HR-deficiency (Fig. 4).

**Fig. 3: Mutational signatures associated with FFPE artefact.**

**Fig. 4: Overview of Genomics England FFPE cohort.**

Given the observations above, we created a score to quantify FFPE-related artefact per sample, called FFPEimpact—a quantitative index based on the proportion of FFPE-related substitution and indel artefacts (Supplementary Methods Section S1.8). FFPEimpact scores ranged from 0 to 0.74 with a median of 0.27 (IQR = 0.14) in the cohort of 578 cases. There did not appear to be systematic loss of genomic information as FFPEimpact scores increased and there was no correlation to sequencing metrics. This was consistent across all tissue-types (Fig. 4). However, FFPEimpact appeared to be enriched in the Covaris-based DNA extraction protocol when compared to a Qiagen system (p < 0.0001) (Supplementary Fig. 5). At lower levels, FFPEimpact scores are predominantly influenced by ID FFPE, with its contribution increasing as the FFPEimpact score increases. However, at higher scores, there is a discernible contribution from both indel and SBS artefacts (Supplementary Fig. 4). Crucially, FFPEimpact can be provided as a metric within a typical WGS report to enhance clinical awareness that a particular FFPE WGS sample may contain substantial artefacts, informing clinical interpretation.

The identification of artefactual signatures within any sample enables the subtraction of these signatures prior to the application of clinically relevant algorithms. For example, HRDetect²¹, a probabilistic score of Homologous Recombination deficiency, can be critically salvaged providing biologically valuable information (Fig. 4; Supplementary Methods Section S1.9; Supplementary Data 6). Note that 18 samples were identified as HRDetect ‘high’ or HR-deficient, in contrast to 331 samples that were misassigned with SBS3 and SBS8.

The results thus far indicate that it is possible to analyse WGS of FFPE samples using customary ‘primary’ (alignment) and ‘secondary’ (somatic mutation-calling) WGS bioinformatic pipelines without filtering of variants based on VAF or any other parameter, minimising the risk of discarding clinically relevant variants. Instead, existing WGS bioinformatic pipelines can be used with the simple characterisation of FFPE signatures and the addition of the FFPEimpact score. Subsequent subtraction of FFPE-related signatures during downstream ‘tertiary’ analysis can be seamlessly incorporated if desired, and algorithmic stratification based on mutational signatures such as HRDetect can be utilised effectively.

Analysis of validation cohorts

To validate this potential informatic process, 51 matched pairs of FF and FFPE samples were sourced from the Oxford Molecular Diagnostics Centre¹⁴. This cohort included patients with breast, colorectal, kidney, lung, prostate, or uterine cancers (Supplementary Table 2). We applied the same analytic workflow of a standard WGS bioinformatic pipeline followed by FFPEimpact and FFPE-related artefactual subtraction in tertiary analysis. Using our method above, there was no significant difference in the total number of calls in driver genes (Fig. 5A; Supplementary Table 11 and Supplementary Data 10). There was strong concordance in actionable variants between the two preparations (r = 0.84, p < 0.0001; Fig. 5B; Supplementary Table 12), reiterating that FFPE-derived WGS data can confer clinical value. Similarly strong correlations were seen in substitution signatures (r = 0.98, p < 0.0001; Fig. 5C) and rearrangement signatures (r = 0.86, p < 0.0001; Fig. 5C). In keeping with the observed larger indel burden in FFPE samples (Supplementary Table 10; Supplementary Figs. 6, 7), the concordance of indel signatures was only moderate (r = 0.59, p = 0.0081; Fig. 5C). Copy number solutions for ploidy/aberrant cell fractions were not automatically achieved in ~7.8% (4/51) of FF samples and 19.6% (10/51) of FFPE samples due to low tumour cellularity (Supplementary Data 7). Satisfactory copy number solutions for these samples were subsequently achieved with manual reseeding. 57% (29/51) of the FF/FFPE pairs showed strong concordance across ploidy, aberrant cell fractions and overall copy number profiles. Interestingly, 10% of pairs showed heterogeneity in copy number profiles, believed to be genuine biological differences in copy number aberrations. Indeed, 13.7% (7/51) FF samples performed slightly worse than FFPE samples, correlating to differing tumour cellularity between FF and FFPE samples. These findings are in-keeping with previous reports of heterogeneity in results between samples taken from different sites within the same tumour^22,23,24. Importantly, quantification of artefact using the FFPEimpact score was again possible (Supplementary Data 8) underscoring the dominant role of indel artefact, yet highlighting the increasing significance of SBS artefact with higher scores.

**Fig. 5: Comparison of two validation cohorts with matched FFPE and FF specimens.**

A second validation set was sought from a real-world prospective neoadjuvant clinical trial of early-stage triple-negative breast cancer patients (PARTNER), which provided 14 samples with FFPE WGS data and matched FF WGS data (co-consented to the Personalised Breast Cancer Programme (PBCP) in Cambridge (herewith, PARTNER/PBCP) (Supplementary Table 3). Sample processing and analysis were akin to what would occur in routine clinical diagnostic pathways. The majority of identified genomic characteristics were concurrently identified in both specimen preparations. Note that while some variants were seen in FF samples and not in FFPE, the reverse is also true (Fig. 5D; Supplementary Table 13 and Supplementary Data 10). Using customary mutational signature assignment methods, there was good concordance of positively and negatively identified substitution (~91%), indel (~88%), and rearrangement signatures (~89%). Where there were discordances in signature assignments, the absolute assignment was of a very small number of mutations (reflected in the pale colouring in Fig. 5D). Indeed, for indel signatures, there was a higher likelihood of systematic false positive assignment of ID6 (related to TOP2A mutations) and ID8 (related to homologous-recombination deficiency) in FF samples. However, it is preferable to use multiple genomic features rather than individual signatures to classify cancers, as exemplified by the HRDetect algorithm. We found that all of the samples that had ID8 reported in FF samples but not in FFPE samples did not have a high HRDetect score, reinforcing that the ID signature assignments that were unique to FF were likely false positives. Critically, quantification of artefact per sample followed by subtraction of artefact as described above helped to improve WGS analysis, demonstrating 7/8 concordance of the HRDetect homologous-recombination score as an example (Supplementary Data 9). Interestingly, comparing HRDetect with other HRD callers like CHORD²⁵ underscores the value of using FFPEimpact to remove ID artefact signatures (Supplementary Data 8). Unlike HRDetect, CHORD was misled by FFPE-induced indels, resulting in no HR deficiency calls (Supplementary Data 9).

Discussion

Although FF-derived material undoubtedly offers WGS data of the highest quality, the logistical demands of FF collections remain a significant operational barrier for many, exacerbating disparities in health-care access. Recent commissions highlighting such inequalities have advocated for breakthroughs that can reduce the access gap and democratise diagnostics^26,27,28.

Whilst FF tissue may be the gold standard for WGS, FFPE remains the preferred method of tissue preservation as histology and immunohistochemistry are central to diagnostics. Moreover, tissue yields from certain clinical diagnostic procedures, such as core biopsies, may be too small to obtain both FF and FFPE samples. Excisional biopsies of incorrectly presumed benign lesions and/or failure of conventional treatment could lead to ex post facto justifications for WGS, yet further acquisition of FF samples may not be possible. Thus, a sweeping exclusion of FFPE specimens from WGS pathways would systematically exclude patients from access to WGS technology that could confer clinical benefit. Other tissue storage solutions, such as RNAlater¹⁸ may address some of the operational barriers seen with FF tissue and produce higher-quality WGS data than FFPE. However, as long as mainstream tissue diagnostics is reliant on FFPE specimens, there will be a cohort of patients for which only FFPE tissue is available.

In medicine, we regularly encounter situations (or specimens) that are imperfect, and we must do the best we can with the information at hand. Here, we show that although WGS from FFPE samples may sometimes demonstrate poorer quality sequencing metrics, the data obtained are not void of clinical value. Rather than adopting the conventional practice of simply filtering mutations based on their VAF, we sought to derive a means of quantifying and characterising the artefacts, that can then be incorporated into downstream processing to maximise information recovered from FFPE samples^16,17. We comprehensively characterised artefacts associated with FFPE samples and demonstrated that it is possible to preserve actionable information from WGS of cancers using FFPE specimens, including actionable drivers, mutational signatures, and machine-learning-based algorithms. We developed and validated a new feature—the FFPEimpact score—serving as caveat emptor for the interpreting clinician. It is anticipated that with future optimisation of library kits for FFPE WGS, new artefact signatures may be discovered, which can be incorporated into the FFPEimpact score through modifications. Our matched FF and FFPE sample analyses demonstrate that in spite of FFPE-related substitution and indel artefacts and/or deficiencies in copy number signals, there is sufficient concordance in actionable information to support how clinically beneficial WGS data can be obtained from FFPE specimens, if all other options for obtaining tumour material were unavailable (an overview comparison between FF and FFPE WGS is provided in Supplementary Table 14).

The feasibility of WGS in FFPE solid tumours shown here is predicated on having matched normal samples derived from blood, not from surrounding FFPE tissue. The latter incurs a high failure rate. We appreciate that sample collection bias may contribute to the observed high success rate, however, in the validation clinical trial, patients were recruited based entirely on clinical inclusion criteria without sample preselection. Here, we still found a high success rate and a strong correlation between FFPE and FF WGS samples. It is also important to note that reliable detection of structural variants requires sufficient insert sizes, which may be compromised in older FFPE samples. In this work, we do not address the impact of FFPE on other ‘omic’ modalities, such as the transcriptome, methylome or proteome, which are increasingly being used in standard-of-care cancer profiling for selected cancer types.

Nevertheless, the findings presented here demonstrate that despite artefacts, WGS can be performed on FFPE specimens if necessary. The analytical advancements presented here can be applied to existing WGS cancer pipelines to characterise the presence of FFPE-associated artefacts and to provide a measure of the amount of artefact using the FFPEImpact score within a WGS report. Crucially, permitting FFPE WGS will not require changes to existing clinical pathways, nor should it interfere with existing FF or RNAlater sample collection. It does, however, widen participation and improve access, and should be in the arsenal of diagnostic tools available to clinicians worldwide.

Methods

Study design and participants

Evaluation of FFPE WGS was performed on 578 samples which were compared to 11,014 FF samples in the discovery phase. All patients were diagnosed with cancer in England’s National Health Service and underwent a primary surgical procedure with curative intent without prior adjuvant therapy. All provided written informed consent for WGS of tumour and a matched germline sample (peripheral blood mononuclear cells) via the 100,000 Genomes Project and were recruited across all thirteen Genomic Medicine Centres in England. The following tumour-types were included in the study: breast, central nervous system (CNS), colorectal, kidney, lung, ovary, prostate, uterus, and bladder (Supplementary Table 1; Supplementary Fig. 2). A fraction of FF tumour samples underwent polymerase chain reaction (PCR) during library preparation (n = 899, 8% of FF samples, herewith referred to as FF(PCR)).

In the validation phase, 51 matched pairs of FF and FFPE samples were sourced from the Oxford Molecular Diagnostics Centre¹⁴. This cohort included patients with breast, colorectal, kidney, lung, prostate, or uterine cancers (Supplementary Table 2). Another validation set was sought from a prospective neoadjuvant clinical trial of early-stage triple-negative breast cancer patients (PARTNER)²⁹, which provided 14 samples with FFPE WGS data. These same 14 patients had co-consented to the Personalised Breast Cancer Programme (PBCP), which provided FF WGS data. Both these studies were conducted in Cambridge (Supplementary Table 3).

Procedures

Details regarding nucleic acid extraction, DNA library preparation and genome sequencing are provided in the appendix (Supplementary Methods Section S1). Of particular note, the FFPE samples underwent a two-step DNA repair process (Supplementary Fig. 8): DNA repair step 1 contained Uracil DNA gylocyslase and Endonuclease IV to remove deaminated cytosine residues and hydrolyse the backbone at the abasic site. DNA repair step 2 contained RecJ to remove single-stranded DNA. All sequencing was performed at an accredited laboratory on a HiSeqX for the Genomics England cohort, HiSeq2500 for the Oxford cohort and a Novaseq 6000 for the PARTNER/PBCP cohort (Illumina, San Diego, CA, USA) to an average coverage of >90× for tumour samples and >30× for matched normal samples (Supplementary Table 4).

Raw short reads were aligned to reference genome build 38. Sequencing quality metrics were analysed and compared between different sample preparations. Insert sizes represent lengths of sequenced DNA fragments (Supplementary Datas 1, 7). Chimeric DNA percentage is a metric that indicates the proportion of reads synthesised from more than one template (caused by random inter-chromosomal DNA cross-linking due to DNA strand breakage). Mapping rate is defined as the percentage of reads that can be mapped to the reference genome. Coverage heterogeneity reports the read depth uniformity across the genome, unevenness is calculated as median for the root square deviation of coverage calculated in non-overlapping 100 kb windows; this metric would be 0 for a genome with absolutely uniform coverage. Adenosine/thymine (AT) and guanine/cytosine (GC) bias indicates the percentage of reads that are under or overrepresented in AT-rich or GC-rich genomic regions. Variant allele fraction (VAF) is the proportion of sequencing reads reporting a specific variant, and cancer cell content is the estimated percentage of tumour cells in the sample.

Somatic variants were identified in the 100,000 Genomes Project cohort according to the pipeline detailed by Genomics England, version 8 (Supplementary Methods Section S1.5), without additional VAF-based filtering. For validation cohorts, FF and FFPE pairs were contrasted using the same bioinformatic pipeline: somatic single nucleotide variant (SNV) detection was performed with CaVEMan (Cancer Variants through Expectation Maximisation: http://cancerit.github.io/CaVEMan/) and indel detection used Pindel (http://cancerit.github.io/cgpPindel/)³⁰, again without additional VAF-based filtering. This permits artefacts to be detected as “variants”, but through the size of the cohort, we were able to characterise these artefacts more effectively.

Somatic variants are defined by Genomics England as variants in a virtual panel of potentially actionable genes (168 genes listed in Actionable genes in solid tumour v2; Supplementary Methods Section S1.5). They are akin to ‘driver’ events that are causally implicated in tumourigenesis, although it cannot be excluded that a subset of these mutations may be passenger events. The fraction of samples reporting Domain 1 SNVs was calculated per organ for each DNA preservation method and compared. The relative abundance of SNVs in Domain 1 genes was assessed for each organ type. The top fifteen Domain 1 genes with the most variants in the FF samples were identified for each organ and the proportions of samples reporting these variants were compared between the different specimen preparation methods. The correlation between VAF and cancer cell content was assessed for specific variants to understand whether there were systematic differences between cohorts and potentially false-positive calls in FFPE samples. Somatic copy number drivers of two classes, amplification of an oncogene and homozygous deletion of a tumour suppressor gene, were sought in a set of eight clinically important genes (CDKN2A, MAP2K4, PTEN, ERBB2, FGFR1, GNAS, SOX2, CCND1 - full methodology detailed in the Supplementary Methods Section S1.10.).

Mutations were considered actionable if identified as ESMO Scale of Clinical Actionability for molecular Targets (ESCAT) tier I or tier II in the OncoKB (https://www.oncokb.org, v.3.12) or Clinical Interpretation of Variants in Cancer (CiVIC, https://civicdb.org/, v.2.0.0) databases. Detailed definitions of tier I and tier II variants are available elsewhere³¹. Briefly, ESCAT tier I targets are ready for implementation in routine clinical decisions; alteration-drug match is associated with improved outcomes in clinical trials. ESCAT tier II are investigational targets; the alteration-drug match is associated with antitumour activity, but the magnitude of the benefit is unknown.

Mutational catalogues were generated for all FFPE samples as described previously²⁰. Briefly, single base substitution (SBS) variants were classified according to their trinucleotide context, forming a 96-channel catalogue for each sample. Similarly, indel (ID) catalogues were generated for each FFPE (PCR) sample using 83 channels according to indel type and length. Tissue-specific SBS signatures or existing ID signatures were assigned for all FFPE (PCR) samples as previously reported³². In short, tissue-specific common SBS signatures were identified for each sample and subtracted from the mutational catalogue. For ID catalogues, existing ID signatures were identified for each sample and subtracted. Residual mutations were then clustered to identify recurrent patterns for signature extraction related to FFPE artefact. For each organ, the set of putative SBS signatures was annotated using the reference signatures reported at https://signal.mutationalsignatures.com/. The identified artefactual signatures were combined to calculate their contribution towards the total mutational catalogue per sample.

Statistical analysis

Statistical differences between groups were calculated in R (version 4.0.3) using Wilcoxon rank-sum and Kruskal–Wallis statistical tests. Correlation was assessed using Spearman’s Rank Correlation. All numerical data are reported as median (interquartile range).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Primary data from the 100,000 Genomes Project, which are held in a secure research environment, are available to registered members of the Research Network. Membership of the Research Network is open to all individuals, students, or staff affiliated with UK academic research institutions, NHS trusts, relevant charitable organisations, foreign universities and research institutions, governmental departments, and foreign healthcare organisations involved in significant research activity. Review of applications can take up to 10 working days. Following approval, confirmation of affiliation, and passing of information governance training, access to the Research Environment can take up to 2 working days. See https://www.genomicsengland.co.uk for further information or contact M.A.B., Chief Scientific Officer at Genomics England (matt.brown@genomicsengland.co.uk). Source data are provided with this paper.

Code availability

Analysis was performed in R (version 4.0.3). Mutational signature analysis was performed using the signature.tools.lib package (https://github.com/Nik-Zainal-Group/signature.tools.lib.git). FFPEimpact code for artefact has been made available online (https://github.com/Nik-Zainal-Group/FFPE_impact, https://doi.org/10.5281/zenodo.12725299).

References

Berger, M. F. & Mardis, E. R. The emerging clinical relevance of genomics in cancer medicine. Nat. Rev. Clin. Oncol. 15, 353–365 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cottrell, C. E. et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J. Mol. Diagn. 16, 89–105 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pritchard, C. C. et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J. Mol. Diagn. 16, 56–67 (2014).
Article CAS PubMed PubMed Central Google Scholar
Wagle, N. et al. High-throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel sequencing. Cancer Discov. 2, 82–93 (2012).
Article CAS PubMed Google Scholar
Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 153, 17–37 (2013).
Article CAS PubMed Google Scholar
Roychowdhury, S. et al. Personalized oncology through integrative high-throughput sequencing: a Pilot Study. Sci. Transl. Med. 3, 111ra121 (2011).
Article PubMed PubMed Central Google Scholar
Nakagawa, H., Wardell, C. P., Furuta, M., Taniguchi, H. & Fujimoto, A. Cancer whole-genome sequencing: present and future. Oncogene 34, 5943–5950 (2015).
Article CAS PubMed Google Scholar
Ven et al. Whole genome sequencing in oncology: using scenario drafting to explore future developments. BMC Cancer 21, 488 (2021).
Article PubMed PubMed Central Google Scholar
Rosenquist, R. et al. Clinical utility of whole-genome sequencing in precision oncology. Semin. Cancer Biol. 84, 32–39 (2022).
Article CAS PubMed Google Scholar
Cuppen, E. et al. Implementation of whole-genome and transcriptome sequencing into clinical cancer care. JCO Precis. Oncol. 6, e2200245 (2022).
Article ADS PubMed PubMed Central Google Scholar
Xuan, J., Yu, Y., Qing, T., Guo, L. & Shi, L. Next-generation sequencing in the clinic: promises and challenges. Cancer Lett. 340, 284–295 (2013).
Article CAS PubMed Google Scholar
Wong, S. Q. et al. Targeted-capture massively-parallel sequencing enables robust detection of clinically informative mutations from formalin-fixed tumours. Sci. Rep.-uk 3, 3494 (2013).
Article ADS Google Scholar
Do, H., Wong, S. Q., Li, J. & Dobrovic, A. Reducing sequence artifacts in amplicon-based massively parallel sequencing of formalin-fixed paraffin-embedded DNA by enzymatic depletion of uracil-containing templates. Clin. Chem. 59, 1376–1383 (2013).
Article CAS PubMed Google Scholar
Robbe, P. et al. Clinical whole-genome sequencing from routine formalin-fixed, paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. Genet. Med. 20, 1196–1205 (2018).
Article CAS PubMed PubMed Central Google Scholar
Strom, S. P. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol. Med. 13, 3–11 (2016).
Article PubMed PubMed Central Google Scholar
Ye, X. et al. High T790M detection rate in TKI-naive NSCLC with EGFR sensitive mutation: truth or artifact? J. Thorac. Oncol. 8, 1118–1120 (2013).
Article PubMed Google Scholar
Guo, Q. et al. The mutational signatures of formalin fixation on the human genome. Nat. Commun. 13, 4487 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Staaf, J. et al. Whole-genome sequencing of triple-negative breast cancers in a population-based clinical study. Nat. Med. 25, 1526–1533 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mathieson, W. & Thomas, G. A. Why formalin-fixed, paraffin-embedded biospecimens must be used in genomic medicine: an evidence-based review and conclusion. J. Histochem. Cytochem. 68, 543–552 (2020).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat. Med. 23, 517–525 (2017).
Article CAS PubMed PubMed Central Google Scholar
Yates, L. R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 21, 751–759 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hao, J.-J. et al. Spatial intratumoral heterogeneity and temporal clonal evolution in esophageal squamous cell carcinoma. Nat. Genet. 48, 1500–1507 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gerlinger, M. et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat. Genet. 46, 225–233 (2014).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, L., Martens, J. W. M., Hoeck, A. V. & Cuppen, E. Pan-cancer landscape of homologous recombination deficiency. Nat. Commun. 11, 5584 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Fleming, K. A. et al. The Lancet Commission on diagnostics: transforming access to diagnostics. Lancet 398, 1997–2050 (2021).
Article PubMed PubMed Central Google Scholar
Lawler, M. et al. European Groundshot—addressing Europe’s cancer research challenges: a Lancet Oncology Commission. Lancet Oncol. 24, e11–e56 (2023).
Article PubMed Google Scholar
Knaul, F. M. et al. The Lancet Commission on cancer and health systems: harnessing synergies to achieve solutions. Lancet 398, 1114–1116 (2021).
Article PubMed Google Scholar
Abraham, J. E. et al. The PARTNER trial of neoadjuvant olaparib with chemotherapy in triple-negative breast cancer. Nature 629, 1142–1148 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Mateo, J. et al. A framework to rank genomic alterations as targets for cancer precision medicine: the ESMO Scale for Clinical Actionability of molecular Targets (ESCAT). Ann. Oncol. 29, 1895–1902 (2018).
Article CAS PubMed PubMed Central Google Scholar
Degasperi, A. et al. Substitution mutational signatures in whole-genome–sequenced cancers in the UK population. Science 376, abl9283 (2022).

Download references

Acknowledgements

This work was funded by Cancer Research UK (CRUK) Advanced Clinician Scientist Award (C60100/A23916, S.N.-Z.), Dr. Josef Steiner Cancer Research Award 2019 (S.N.-Z.), Basser Gray Prime Award 2020 (S.N.-Z.), CRUK Pioneer Award (C60100/A23433, S.N.-Z.), CRUK Grand Challenge Award (C60100/A25274, S.N-Z.), CRUK Early Detection Project Award (C60100/A27815, S.N.-Z.), and the National Institute of Health Research (NIHR) Research Professorship (NIHR301627, S.N.-Z.). This work was also supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014 and NIHR203312). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The Personalised Breast Cancer Programme (PBCP) received funding from the following bodies: Addenbrookes Charitable Trust Grant [9800]; The Mark Foundation for Cancer Research and Cancer Research UK [C9685/A25177]; Cancer Research UK [A27657]. The PARTNER trial received funding from the following: Cancer Research UK [CRUKE/14/048]; AstraZeneca [1994-A093777]. This work was enabled by the access to data and findings generated from the 100,000 Genomes Project under the auspices of the Pan-Cancer GeCIP (project RR239). The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care), funded by the National Institute for Health Research and NHS England. The Wellcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure. The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. The PBCP and PARTNER trials acknowledge primarily the patients and the families and friends who supported them in participating in these studies. Also to NIHR Cambridge BRC for their support for staff and infrastructure costs; The Cancer Research UK Cambridge Centre Cambridge for their support for staff and infrastructure costs; Cancer Molecular Diagnostics Laboratory for their support for sample collection; The Precision Breast Cancer Institute Team for their support for sample collection; The Cambridge Clinical Trials Centre—Cancer Theme for their support with core staff support; The clinical trials support staff at all participating sites.

Author information

These authors contributed equally: Shadi Basyuni, Laura Heskin.
A full list of members and their affiliations appears in the Supplementary Information.

Authors and Affiliations

Department of Medical Genetics, National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge, Cambridge, UK
Shadi Basyuni, Laura Heskin, Andrea Degasperi, Daniella Black, Gene C. C. Koh, Lucia Chmelova, Giuseppe Rinaldi, Yasin Memari, Marc Tischkowitz, Helen R. Davies & Serena Nik-Zainal
Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
Shadi Basyuni, Laura Heskin, Andrea Degasperi, Daniella Black, Gene C. C. Koh, Lucia Chmelova, Giuseppe Rinaldi, Yasin Memari, Helen R. Davies & Serena Nik-Zainal
Precision Breast Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
Steven Bell & Jean Abraham
Cancer Research UK Cambridge Centre, University of Cambridge, Li Ka Shing Centre, Cambridge, UK
Steven Bell, Carlos Caldas & Jean Abraham
Cambridge Cancer Trials Centre, Department of Oncology, Cambridge University Hospitals, Cambridge, UK
Louise Grybowicz
Genomics England, One Canada Square, London, UK
Greg Elgar
Department of Oncology, University of Oxford, Oxford, UK
Pauline Robbe, Anna Schuh & Matthew A. Brown
RIKEN Centre for Integrative Medical Sciences, Yokohama, Japan
Pauline Robbe
Illumina Cambridge, Ltd, Illumina Centre, Cambridge, UK
Zoya Kingsbury
Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London, UK
Louise Jones

Authors

Shadi Basyuni
View author publications
You can also search for this author in PubMed Google Scholar
Laura Heskin
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Degasperi
View author publications
You can also search for this author in PubMed Google Scholar
Daniella Black
View author publications
You can also search for this author in PubMed Google Scholar
Gene C. C. Koh
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Chmelova
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar
Steven Bell
View author publications
You can also search for this author in PubMed Google Scholar
Louise Grybowicz
View author publications
You can also search for this author in PubMed Google Scholar
Greg Elgar
View author publications
You can also search for this author in PubMed Google Scholar
Yasin Memari
View author publications
You can also search for this author in PubMed Google Scholar
Pauline Robbe
View author publications
You can also search for this author in PubMed Google Scholar
Zoya Kingsbury
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Caldas
View author publications
You can also search for this author in PubMed Google Scholar
Jean Abraham
View author publications
You can also search for this author in PubMed Google Scholar
Anna Schuh
View author publications
You can also search for this author in PubMed Google Scholar
Louise Jones
View author publications
You can also search for this author in PubMed Google Scholar
Marc Tischkowitz
View author publications
You can also search for this author in PubMed Google Scholar
Matthew A. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Helen R. Davies
View author publications
You can also search for this author in PubMed Google Scholar
Serena Nik-Zainal
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

PARTNER Trial Group

Jean Abraham

Personalised Breast Cancer Program Group

Jean Abraham
, Steven Bell
& Carlos Caldas

Contributions

Study conception, design, or data acquisition: S.N.-Z., S.Bas., L.H., G.E., P.R., C.C., J.A., A.S., L.J., M.T., M.B., L.G. Data curation, analysis and/or interpretation: S.Bas., L.H., A.D., G.C.C.K., S.Bel., Y.M., Z.K., H.R.D., G.R., L.C., D.B. Visualisation: S.Bas, L.H. Manuscript drafting: S.Bas, L.H., S.N.-Z. All authors were involved in manuscript review and editing. Manuscript critical revision: S.Bas, S.N.-Z.

Corresponding author

Correspondence to Serena Nik-Zainal.

Ethics declarations

Competing interests

A.D., H.R.D., G.C.C.K., G.R., and S.N.-Z. hold patents or have submitted applications on clinical algorithms of mutational signatures: MMRDetect (PCT/EP2022/057387), HRDetect (PCT/EP2017/060294), clinical use of signatures (PCT/EP2017/060289), rearrangement signature methods (PCT/EP2017/060279), clinical predictor (PCT/EP2017/060298), and hotspots for chromosomal rearrangements (PCT/EP2017/060298). Z.K. is an employee of Illumina, Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Hidewaki Nakagawaand the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Dataset 1

Supplementary Dataset 2

Supplementary Dataset 3

Supplementary Dataset 4

Supplementary Dataset 5

Supplementary Dataset 6

Supplementary Dataset 7

Supplementary Dataset 8

Supplementary Dataset 9

Supplementary Dataset 10

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Basyuni, S., Heskin, L., Degasperi, A. et al. Large-scale analysis of whole genome sequencing data from formalin-fixed paraffin-embedded cancer specimens demonstrates preservation of clinical utility. Nat Commun 15, 7731 (2024). https://doi.org/10.1038/s41467-024-51577-2

Download citation

Received: 03 April 2024
Accepted: 07 August 2024
Published: 04 September 2024
DOI: https://doi.org/10.1038/s41467-024-51577-2
Springer Nature Limited

Large-scale analysis of whole genome sequencing data from formalin-fixed paraffin-embedded cancer specimens demonstrates preservation of clinical utility

Abstract

Similar content being viewed by others

Introduction

Results

Quality of sequencing data

Drivers and actionable variants

Mutational signature analysis defines FFPE-related signatures

Analysis of validation cohorts

Discussion

Methods

Study design and participants

Procedures

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

PARTNER Trial Group

Personalised Breast Cancer Program Group

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation