Main

A central deficiency in our knowledge of cancer concerns how genomic changes drive the proteome and phosphoproteome to execute phenotypic characteristics1,2,3,4. The initial proteomic characterization in the The Cancer Genome Atlas (TCGA) breast cancer study was performed using reverse phase protein arrays (RPPA); however this approach is restricted by antibody availability. To provide greater analytical breadth, the NCI Clinical Proteomic Tumor Analysis Consortium (CPTAC) is using mass spectrometry to analyse the proteomes of genome-annotated TCGA tumour samples5,6. Here we describe integrated proteogenomic analyses of TCGA breast cancer samples representing the four principal mRNA-defined breast cancer intrinsic subtypes7,8.

Proteogenomic analysis of TCGA samples

105 breast tumours previously characterized by the TCGA were selected for proteomic analysis after histopathological documentation (Supplementary Tables 1 and 2). The cohort included a balanced representation of PAM50-defined intrinsic subtypes9 including 25 basal-like, 29 luminal A, 33 luminal B, and 18 HER2 (ERBB2)-enriched tumours, along with 3 normal breast tissue samples. Samples were analysed by high-resolution accurate-mass tandem mass spectrometry (MS/MS) that included extensive peptide fractionation and phosphopeptide enrichment (Extended Data Fig. 1a). An isobaric peptide labelling approach (iTRAQ) was employed to quantify protein and phosphosite levels across samples, with 37 iTRAQ 4-plexes analysed in total. A total of 15,369 proteins (12,405 genes) and 62,679 phosphosites were confidently identified with 11,632 proteins per tumour and 26,310 phosphosites per tumour on average (Supplementary Tables 3, 4 and Supplementary Methods). After filtering for observation in at least a quarter of the samples (Supplementary Methods, Extended Data Fig. 1b), 12,553 proteins (10,062 genes) and 33,239 phosphosites, with their relative abundances quantified across tumours, were used in subsequent analyses in this study. Stable longitudinal performance and low technical noise were demonstrated by repeated interspersed analyses of a single batch of patient-derived luminal and basal breast cancer xenograft samples10 (Extended Data Fig. 1d, e). Owing to the heterogeneous nature of breast tumours11,12,13, and because proteomic analyses were performed on tumour fragments that were different from those used in the genomic analyses, rigorous pre-specified sample and data quality control metrics were implemented14,15 (Supplementary Discussion and Extended Data Figs 2, 3). Extensive analyses concluded that 28 of the 105 samples were compromised by protein degradation. These samples were excluded from further analysis with subsequent informatics focused on the 77 tumour samples and three biological replicates.

Genome and transcriptomic variation was observed at the peptide level by searching MS/MS spectra not matched to RefSeq against a patient-specific sequence database (Fig. 1a). The database was constructed using the QUILTS software package16, leveraging RefSeq gene models based on whole-exome and RNA-seq data generated from portions of the same tumours and matched germline DNA (Fig. 1a, Supplementary Table 5). Although these analyses detected a number of single amino acid variants, frameshifts, and splice junctions, including splice isoforms that had been detected as only single transcript reads by RNA-seq (Fig. 1b, Supplementary Table 5), the number of genomic and transcriptomic variants that were confirmed as peptides by MS/MS was low (Supplementary Discussion). Sparse detection of individual genomic variants by peptide sequencing has been noted in our previous studies16 and reflects limited coverage at the single amino acid level with current technology. However, quantitative MS/MS analysis of multiple peptides for each protein is used to reliably infer overall protein levels. This is an advantage of MS/MS, as antibody-based protein expression analysis is typically based on a single epitope. To illustrate this capability in the current data set, an initial analysis of three frequently mutated genes in breast cancer (TP53, PIK3CA, and GATA3) and three clinical biomarkers (oestrogen receptor (ER; ESR1), progesterone receptor (PGR), and ERBB2) was conducted (Fig. 1c, Supplementary Table 6, 7 and Supplementary Discussion). As expected, TP53 missense mutations were associated with elevated MS/MS-based protein levels, as observed by RPPA, especially in basal-like breast cancer. TP53 nonsense and frameshift mutations were associated with a decrease in TP53 protein levels that was particularly pronounced in the MS/MS data. In contrast, the mostly C-terminal GATA3 frameshift alterations did not result in decreased protein expression when measured by the median of all GATA3 peptides, suggesting that these proteins are expressed despite truncation. No consistent effect of somatic PIK3CA mutation was observed at the level of protein expression. Good Pearson correlations between RNA-seq and MS/MS protein-expression levels were found for ESR1 (r = 0.74), PGR (r = 0.74), ERBB2 (r = 0.84) and GATA3 (r = 0.83), with moderate correlations observed for PIK3CA (r = 0.45) and TP53 (r = 0.36). Lower TP53 protein abundance levels compared to mRNA levels were especially prevalent in luminal tumours, suggesting post-transcriptional regulatory mechanisms such as proteasomal degradation. To explore this hypothesis, a search was made for E3 ligases that showed negative correlation to p53 protein (Supplementary Table 8). These analyses identified UBE3A (r = −0.42; adjusted P value = 0.05) (Extended Data Fig. 4a), an established TP53 E3 ligase17. In comparing copy number alterations (CNAs), RNA, and protein levels for GATA3, copy number gains in chromosome 10q were anticorrelated with RNA and protein levels in basal-like tumours. This observation prompted a search for other gains or losses that were anticorrelated with RNA and/or protein levels (see Extended Data Fig. 4b for further analyses). Overall, six genes were identified that significantly anticorrelated at a false discovery rate (FDR) <0.05 on both RNA and protein levels to their CNA signals (Extended Data Fig. 4b). GATA3 amplification on 10q in basal-like breast cancer showed the strongest anticorrelation, followed by the hexosamine and glycolysis pathway enzymes GFPT2 and HK3, which are upregulated in basal-like breast cancer despite being subjected to frequent chromosomal deletion on 5q. Global analysis of the correlation of mRNA-to-protein yielded a median Pearson value of r = 0.39, with 6,135 out of 9,302 mRNA–protein pairs (66.0%) correlating significantly at an FDR <0.05 (Extended Data Fig. 4c, Supplementary Table 9 and Supplementary Discussion). Similar to a previous colon cancer analysis6, metabolic functions such as amino acid, sugar and fatty acid metabolism were found to be enriched among positively correlated genes18 whereas ribosomal, RNA polymerase and mRNA splicing functions were negatively correlated. Overall these analyses demonstrate the utility of global proteome correlation analysis for both confirmation of suspected regulatory mechanisms and identification of candidate regulators meriting further investigation.

Figure 1: Direct effects of genomic alterations on protein level.
figure 1

a, b, Overlap of protein-coding single amino acid variants (a) and RNA splice junctions (b) not present in RefSeq v60 detected by DNA exome sequencing, RNA-seq, and LC–MS/MS. Proportions of novel variants are noted. c, Heat map of mutations/CNA and their effects on RNA and protein expression of breast-cancer-relevant genes across tumour and normal samples. ER, PR, HER2 and PAM50 status are annotated. Median iTRAQ protein abundance ratio and the most frequently detected and differential phosphosite ratio are shown for each gene. Pearson correlations between MS/MS protein and RNA-seq, and MS/MS protein and RPPA are indicated.

PowerPoint slide

Copy number alterations

To determine the consequences of CNAs on mRNA, protein, and phosphoprotein abundance, both in ‘cis’ on genes within the aberrant locus and in ‘trans’ on genes encoded elsewhere, univariate correlation analysis was used as previously described6. A total of 7,776 genes with CNA, mRNA and protein measurements were analysed by calculating Pearson correlation and associated statistical significance (Benjamini–Hochberg-corrected P value) for all possible CNA–mRNA and CNA–protein pairs (Fig. 2a, Supplementary Table 10, Extended Data Fig. 5a, see Methods). For the phosphoproteome, 4,472 CNA–phosphoprotein pairs were analysed (Extended Data Fig. 5b). Significant positive correlations (cis) were observed for 64% of all CNA–mRNA, 31% of all CNA–protein, and 20% of all CNA–phosphoprotein pairs Fig. 2b. Proteins and phosphoproteins correlated in cis to CNAs were, for the most part, a subset of the cis-effects observed in mRNA–CNA correlation (Fig. 2b, Supplementary Table 10). The fractional difference of well-annotated oncogenes and tumour suppressor genes among the significantly cis-correlated CNA–mRNA and CNA–protein gene pairs was analysed. On the basis of a reference list of 487 oncogenes and tumour suppressors (Supplementary Table 10), these cancer-relevant genes occur 37.6% more frequently in the subset of genes that correlate both on CNA–mRNA and CNA–protein levels than in the subset that only correlate on CNA–mRNA but not on CNA–protein levels (Fisher exact P value = 0.02). This suggests that CNA events with a tumour-promoting outcome more likely lead to cis-regulatory effects on both the protein and mRNA level, whereas CNA events with no documented role in tumorigenesis are more likely to be neutralized on the protein level than on the RNA level. Trans-effects (Fig. 2a) appear as vertical bands, with accompanying frequency histograms (in blue) highlighting ‘hot spots’ of significant trans-effects. Using a minimum threshold of 50 trans-affected genes, 68% of the tested genes were associated with trans-effects on the mRNA level, whereas only 13% were associated with effects on the protein level and 8% on the phosphoprotein level. Importantly, CNA–protein correlations appeared to be a reduced representation of CNA–mRNA correlations. Furthermore, for many CNA regions, correlations were more directionally uniform on the protein level than on the mRNA level. CNA regions exhibiting the most trans-associations at the protein level were found on chromosomes 5q (loss of heterozygosity (LOH) in basal; gain in luminal B), 10p (gain in basal), 12 (gain in basal), 16q (luminal A deletion), 17q (luminal B amplification), and 22q (LOH in luminal and basal) (Extended Data Fig. 5a).

Figure 2: Effects of CNAs on mRNA, protein, and phosphoprotein abundance.
figure 2

a, Correlations of CNA (x axes) to RNA and protein expression levels (y axes) highlight new CNA cis- and trans-effects. Significant (FDR < 0.05) positive (red) and negative (green) correlations between CNA and mRNAs or proteins are indicated. CNA cis-effects appear as a red diagonal line, CNA trans-effects as vertical stripes. Histograms show the fraction (%) of significant CNA trans-effects for each CNA gene. b, Overlap of cis-effects observed at RNA, protein, and phosphoprotein levels (FDR < 0.05). c, Trans-effect regulatory candidates identified among those with significant protein cis-effects using LINCS CMap. Bars indicate total numbers of significant CNA–protein trans-effects (grey; FDR < 0.05) and overlap with regulated genes in LINCS knockdown profiles (red; 4 cell lines; moderated t-test FDR < 0.1).

PowerPoint slide

Trans-associations are not necessarily direct consequences of the chromosomal aberration. For example, as 5q loss occurs in at least 50% of basal-like breast cancers19, many of the trans-effects involve genes that mark the basal subtype. To identify candidate driver genes with copy number alterations that are direct drivers of trans-effects, results were compared with functional knockdown data on 3,797 genes in the Library of Integrated Network-based Cellular Signatures (LINCS) database (http://www.lincsproject.org/)20,21,22. For any given gene with copy number alterations (‘CNA-gene’), sets of genes were identified corresponding to proteins that changed where there was gain (‘CNA-gain trans-gene set’) or loss (‘CNA-loss trans-gene set’). These gene sets were then compared to the effects of gene knockdown in the LINCS database (see Supplementary Methods). Queries for 502 different CNA genes meeting the criteria defined above identified 10 CNA genes that could be functionally connected to both CNA-gain and CNA-loss trans-protein-level effects (Extended Data Fig. 5c, Supplementary Table 11). A permutation-based approach implemented to test significance (see Supplementary Methods) yielded an FDR <0.05 for 10 genes affected by both CNA gains and losses (Fig. 2c). These proteins were defined as potential regulatory candidates for the CNA trans-effects observed on the proteome level in this study, as in a gene-dependent manner an average of 17% of these trans-effects were consistent with the knockdown profiles. Notably, the established oncogenic receptor tyrosine kinase ERBB2 was functionally connected only to CNA gain trans-effects (Supplementary Table 11). The E3 ligase SKP1 (ref. 23) and the ribonucleoprotein export factor CETN3, both located on chromosome arm 5q with frequent losses in basal-like breast cancer and less frequent gains in luminal B breast cancer, were detected as potential regulators affecting the expression of the tyrosine kinase and therapeutic target EGFR, and SKP1 also was linked to SRC (Extended Data Fig. 5d). Another potential regulator, FBXO7 (a substrate recognition component of the SCF (SKP1-CUL1-F-box protein)-type E3 ubiquitin ligase complex), was affected mostly by LOH events on chromosome 22q. Interestingly, in a recent human interaction proteome study, SKP1 and FBXO7 were listed as interaction partners24.

Clustering and network analyses

Transcriptional profiling has converged on four major breast cancer subtypes: luminal A, luminal B, basal and HER2-enriched1,9. To investigate the extent to which the PAM50 ‘intrinsic’ breast cancer classification scheme is reflected or refined on the proteome level in the CPTAC samples, clustering analyses were first restricted to the reduced set of PAM50 genes. When RNA data for the 50 PAM50 genes were clustered directly (without using a classifier), the clustering was similar to the TCGA PAM50 annotation (second annotation bar in Fig. 3a). Restricting both the RNA and proteome data to the set of 35 PAM50 genes observed in the proteome produced a similar result (bottom two annotation bars in Fig. 3a), and all of the major PAM50 groups were recapitulated in the proteome almost as well as in the RNA data. This indicates that although different tissue sections of the same tumours were used for RNA-seq and protein analysis, very similar subtype-defining features can be observed in both data types. Global proteome and phosphoproteome data were then used to identify proteome subtypes in an unsupervised manner. Consensus clustering identified basal-enriched, luminal-enriched, and stromal-enriched clusters (Extended Data Figs 6a–d, 7a). Unlike the clustering observed with PAM50 genes, mRNA-defined HER2-enriched tumours were distributed across these three proteomic subgroups. The basal-enriched and luminal-enriched groups showed a strong overlap with the mRNA-based PAM50 basal-like and luminal subgroups, whereas stromal-enriched proteome subtype represented a mix of all PAM50 mRNA-based subtypes, and has a significantly enriched stromal signature (Extended Data Fig. 3e). Among the stromal-enriched tumours there was strong representation of reactive type I tumours, as classified by RPPA (Supplementary Table 12), showing agreement between the RPPA and mass-spectrometry-based protein analyses for the detection of a tumour subgroup characterized by stromal gene expression1.

Figure 3: Proteomic and phosphoproteomic subtypes of breast cancer and subtype-specific pathway enrichment.
figure 3

a, Unsupervised clustering of RNA-seq and proteomics data restricted to PAM50 genes and subset of 35 detected proteins reveal high similarity to PAM50 (TCGA) sample annotation. b, K-means consensus clustering of proteome and phosphoproteome data identifies basal-enriched, luminal-enriched, and stromal-enriched subgroups. c, GSEA highlights sets of pathways significantly differential between basal-enriched and luminal-enriched tumours (detailed in Extended Data Fig. 7b). ReacI and ReacII, reactive type I and II, respectively. d, K-means consensus clustering performed on pathways derived from single sample GSEA analysis of phosphopeptide data identifies four distinct clusters.

PowerPoint slide

As the basal- and luminal-enriched proteome subgroups are coherent, pathway analyses were conducted on these two subtypes, using the stromal-enriched subgroup as a control to assess specificity (Fig. 3c, Extended Data Fig. 7b, Supplementary Table 13). The luminal-enriched subgroup was exclusively enriched for oestradiol- and ESR1-driven gene sets. In contrast, multiple gene sets were enriched and upregulated specifically in the basal-like tumours. Particularly extensive basal-like enrichment was seen for MYC target genes; for cell cycle, checkpoint, and DNA repair pathways including regulators AURKA/B, ATM, ATR, CHEK1/2, and BRCA1/2; and for immune response/inflammation, including T-cell, B-cell, and neutrophil signatures. The complementarity of transcriptional, proteomic, and phosphoproteomic data was also highlighted in these analyses (Extended Data Fig. 7c, d).

Using phosphorylation status as a proxy for activity, phosphoproteome profiling can theoretically be used to develop a signalling-pathway-based cancer classification. K-means consensus clustering was therefore performed on pathways derived from single sample gene set enrichment analysis (GSEA) of phosphopeptide data (Methods, Supplementary Tables 14 and 15). Of four robustly segregated groups, subgroups 2 and 3 substantially recapitulated the stromal- and luminal-enriched proteomic subgroups, respectively (Fig. 3d, Extended Data Fig. 8a). Subgroup 4 included a majority of tumours from the basal-enriched proteomic subgroup, but was admixed particularly with luminal-enriched samples. This subgroup was defined by high levels of cell cycle and checkpoint activity. All basal and a majority of non-basal samples in this subgroup had TP53 mutations. Consistent with high levels of cell cycle activity, a multivariate kinase–phosphosite abundance regression analysis highlighted CDK1 as one of the most highly connected kinases in this study (Extended Data Fig. 8b, Supplementary Table 16). Subgroup 1 was a novel subgroup defined exclusively in the phosphoproteome pathway activity domain, with no enrichment for either proteomic or PAM50 subtypes. It was defined by G protein, G-protein-coupled receptor, and inositol phosphate metabolism signatures, as well as ionotropic glutamate signalling (Fig. 3d). Co-expression patterns among genes/proteins across different subgroups were also analysed using a Joint Random Forest method25 that identified network modules, such as an MMP9 module, with different interaction patterns between basal-enriched and luminal-enriched subgroups. These latter patterns appeared specific to the proteome-level data (Extended Data Fig. 8c–f, Supplementary Table 17 and Supplementary Methods).

Phosphosite markers in PIK3CA- and TP53-mutated tumours

TP53 and PIK3CA are the most recurrently mutated genes in breast cancer, with frequencies for PIK3CA at 43% in luminal tumours and for TP53 at 84% in basal-like tumours1. Most of the PIK3CA missense mutations were gain of function mutations and therefore were expected to lead to activation of the PI3K signalling cascade, but the extent to which this occurs has been controversial and it is unclear which pathway components are effectors26,27. Marker selection analysis was therefore performed for upregulated phosphosites in PIK3CA-mutated tumours. In total, 62 phosphosites were identified that were positively associated with PIK3CA mutation (FDR <0.05), including the kinases RPS6KA5 and EIF2AK4 (Extended Data Fig. 9a, Supplementary Table 18). Calculating the average phosphorylation signal of these marker phosphosites provided a read-out for PI3K pathway activity in PIK3CA-mutated tumours, with 15 of the 26 mutated tumours (58%) exhibiting an activated PIK3CA mutation signature. Of note, the identified PIK3CA mutant phosphoproteome signature was activated in all tumours harbouring helical domain PIK3CA mutations, but only 2 of 10 tumours harbouring kinase domain mutations. To test if the identified differences in the phosphoproteome of PI3K mutant versus wild-type tumours could be explained by mutation of PIK3CA, the tumour data were compared to phosphosite signatures derived from isogenic PIK3CA mutant cell lines28 (Extended Data Fig. 9b, Supplementary Table 18). There was an enrichment of signatures derived from helical-domain-mutated isogenic cell lines, but not from kinase-domain-mutated cells, supporting the observations in primary tumours.

The same strategy was used to identify phosphorylation signalling events connected to TP53 mutation. A total of 56 phosphosites upregulated in TP53-mutated tumours were identified that were independent of basal-like subtype association (Extended Data Fig. 9c, Supplementary Table 18). Using the average phosphorylation signal of these marker phosphosites as a proxy for TP53-mutation-driven cell cycle control, 22 of 41 mutated tumours (54%) showed upregulated signals. This TP53 mutant phosphosignature was somewhat enhanced in tumours in which mutations occurred almost exclusively in the DNA-binding region compared to those with nonsense/frameshift mutations. In addition to the well-described checkpoint kinase CHEK2, significantly upregulated phosphosites were identified for the kinases MASTL and EEF2K in TP53-mutated tumours. Single-sample GSEA analysis of isogenic p53-mutant phosphosignatures showed an enrichment of a phosphosignature derived from R273H-mutated isogenic cells (Extended Data Fig. 9d), confirming the pronounced effect of missense mutations in the DNA-binding region on phosphorylation pathways.

Kinase gene amplification and subtype-specific activation

CNAs span many driver gene candidates and RNA expression has been frequently used to narrow candidate nominations. Proteogenomic analysis should further promote this nomination process. In candidate refinement, a focus on protein kinases is warranted, as many are drug targets. An in-depth proteogenomic pipeline was developed that flagged kinases, expression levels of which were at least 1.5 interquartile ranges higher than the median (Supplementary Table 19). A proteogenomic circos-like29 plot (termed a ‘pircos’ plot) was used to map these outlier values onto the genome (Fig. 4a, b, Extended Data Fig. 10a). The ERBB2 locus showed the strongest effect of increased phosphoprotein levels associated with gene-amplification-driven RNA and protein over expression (Fig. 4a). The kinase CDK12 is a positive transcriptional regulator of homologous recombination repair genes with its partner cyclin K30, and is often encompassed by the ERBB2 amplicon. This gene was also found to be upregulated at the RNA, protein, and phosphosite level indicating that CDK12 is highly active in the majority of ERBB2-positive tumours (Fig. 4a). The analysis of the ERBB2 amplicon also uncovered co-outlier phosphorylation status for MED1, GRB7, MSL1, CASC3 and TOP2A, all previously described in association with ERBB2 amplification. To better understand the downstream effects of ERBB2 amplification, additional phosphosite outliers were identified in 41 known ERBB2 signalling genes for the 15 samples that had ERBB2 phosphosite outlier expression (Extended Data Fig. 10b).

Figure 4: Example analyses of aberrantly regulated kinases in human breast cancer.
figure 4

a, b, pircos (proteogenomics circos) plots showing CNA, RNA, protein, and phosphosite expression for 17 tumours with amplification in 17q (ERBB2 CNA >1) and 8 tumours with amplification in 11q (PAK1 CNA > 1). Labelled genes have CNA >1 and phosphosite >1. c, Proteogenomic outlier expression analysis for ERBB2, CDK12, and PAK1. Samples with outlier phosphosite (red), protein (yellow), RNA (green) and copy number (purple) expression are shown. Phosphosite squares indicate per-sample outlier phosphosites. d, Outlier kinase events by PAM50 subtype (>35% of subtype samples contain a phosphosite outlier; <10% FDR using Benjamini–Hochberg-adjusted P values).

PowerPoint slide

These canonical findings stimulated a proteogenomic analysis to identify additional outlier kinases in the breast cancer genome. A proteogenomic dissection of chromosome 11q based on PAK1 amplification (Fig. 4b, c), a breast cancer driver kinase31, illustrated that PAK1 is hyperphosphorylated in PAK1-amplified tumours, along with CLNS1A, RFS1 and GAB2 (ref. 32) Additional examples of outlier kinases included PTK2 and RIPK2 in association with amplification of chromosome 8q (Fig. 4c, Extended Data Fig. 10a, c). PAK1 and TLK2 (17q23) appear to be luminal-breast-cancer-specific events (Fig. 4c, Extended Data Fig. 10c). To further examine whether outlier kinases were breast cancer subtype-specific independent of amplification status, the Benjamini–Hochberg-corrected probability was calculated of finding the number of phosphosite outliers within a subtype, given the total number of outliers across all subtypes, the subtype sample size and the total sample size (Fig. 4d). These analyses led to the expected identification of ERBB2 in the HER2-enriched subtype at the 5% FDR level, as well as the new finding of CDC42BPG (MRGKγ), an effector kinase for RHO-family GTPases33. In basal-like breast cancer, two kinases, PRKDC and SPEG, were significant at the 5% FDR level. PRKDC is a non-homologous end-joining factor that can be phosphorylated by ATM kinase, and is therefore a logical finding in this disease subset34. However SPEG, a kinase associated with severe dilated cardiomyopathy when suppressed35, has not been previously reported in association with breast cancer. A larger number of subtype-specific kinases were detected at the 10% FDR level, several of which have recently described relevance in breast cancer, including PRKD3 in basal-like breast cancer36, the LKB-regulated SIK3 in luminal A breast cancer37 and CDK13 in luminal B breast cancer, which, similar to CDK12, can interact with cyclin K30.

Discussion

The breadth and depth of proteomic and phosphoproteomic analyses displayed in this study demonstrates the strength of mass-spectrometry-based proteomics, but also some of the limitations inherent in proteolytic peptide sequencing (see Supplementary Discussion). An example of how high-dimensional proteomic analysis provides insight into unresolved genomic issues concerns the study of loss of the long arm of chromosome 5 (5q). Analysis of RNA and protein correlations narrowed the list of potential trans-deregulated proteins. Orthogonal candidate screening using functional genomics methodologies identified loss of CETN3 and SKP1 as potential trans-regulators, with upregulation of EGFR as a downstream consequence in basal-like breast cancers. Although further experimental evidence must be sought for these proposed regulatory relationships, the SKP1–Cullin complex has already been linked to EGFR activation in glioma38. Unfortunately, EGFR targeting has not proven to be effective therapy in basal-like breast cancer to date39. This might be due to the fact the SKP1 loss deregulates multiple targets, therefore mandating a much broader inhibitory strategy.

It is recognized that PIK3CA mutations do not strongly activate canonical downstream effectors28. Mass-spectrometry-based phosphoproteomics provides an opportunity for unbiased examination of downstream signalling events dependent on PIK3CA mutational activation. These studies revealed that common PIK3CA mutations affect a large number of targets with diverse functionalities including the kinases RPS6KA5 and EIF2AK4. Thus, the data and analyses reported here extend our knowledge of the effectors that promote tumorigenesis in response to constitutive activation of PI3 kinase. Similarly, TP53-mutation-associated phosphopeptides point towards novel functionalities, including regulation of the kinases MASTL and EEF2K.

A central goal in breast cancer research has been the identification of druggable kinases beyond HER2. Candidate genes that exhibited similar gene-amplification-driven proteogenomic patterns to ERBB2 included CDK12, TLK2, PAK1 and RIPK2. The proteogenomic link with gene amplification was particularly strong for CDK12, in keeping with its location in the ERBB2 amplicon, whereas the strengths of correlation between DNA amplification, RNA, protein, and phosphoprotein for the other examples were more variable. The presence of activated CDK12 in the ERBB2 amplicon might explain why tumours arising in BRCA1 carriers are usually ERBB2-negative. As a positive transcriptional regulator of BRCA1 and multiple FANC family members, CDK12 promotes DNA repair by homologous recombination. CDK12 amplification would, therefore, oppose the functional effects of BRCA1 haploinsufficiency during tumour evolution30. Overall, multiple outlier kinases generate testable therapeutic hypotheses for which enabling inhibitors are in development. For example, PAK1 has recently been confirmed to be a therapeutic target and poor prognosis factor in luminal breast cancer40.

Although incomplete outcome data and the remarkable heterogeneity of breast cancer are further relevant constraints, the number of TCGA specimens analysed here is insufficient to support conclusive clinical correlations. Only 8 deaths occurred among the 77 patients, which are too few to provide sufficient statistical power for association analysis. Adequately powered MS/MS-based clinical investigation will require microscaled discovery or targeted approaches41, especially given the highly limited amount of patient material available from clinical trials and the mostly formalin-fixed nature of the specimens. The current analysis is therefore centred on biological findings and correlations, with orthogonal validation and false discovery concerns addressed through an examination of cell-line databases of the effects of individual gene perturbations. Typical of a multi-tiered analysis of this complexity, there are many hypotheses to test, and many findings that require further investigation.

In conclusion, this study provides a high-quality proteomic resource for human breast cancer investigation, and illustrates technologies and analytical approaches that provide an important new opportunity to connect the genome to the proteome. Larger-scale exploration of discovery proteomics in the clinical setting will require improvements in clinical investigation, including acquisition of adequate amounts of optimally collected tumour tissue both before and during therapy as well as advances in MS/MS proteomics to reduce sample input and increase sensitivity for low abundance proteins and modified peptides.