Keywords

1.1 Introduction

Genome sequencing efforts initiated in the 1980s fostered a new paradigm in biological research: the system-wide characterization of biomolecules. Within this new paradigm, the field of proteomics, which seeks to characterize proteins on a system-wide level, emerged. Proteins, the major catalytic and structural components within all living systems, are arguably the most informative biomolecules for understanding cellular function and response to systematic perturbations, such as radiation exposure. Unfortunately, proteins are also the most challenging of all biomolecules to study on a system-wide level. In addition to cataloging and quantifying proteins within a complex biological sample, information on their post-translational modification (PTM) state, subcellular localization and interactions with other biomolecules is necessary for full proteome characterization. Adding to the challenge, proteins are dynamic, changing their abundance, PTM state, localization and interactions in response to stimuli. Gene sequences or even mRNA expression levels cannot reveal or predict this protein-level information [1, 2]. Therefore technologies for direct analysis of proteins are necessary for proteome characterization.

Although no single technology can fully characterize all aspects of proteomes, mass spectrometry (MS) is the most powerful and flexible for proteomic analysis. The revolutionary discoveries in the late 1980s of Matrix-Assisted Laser Desorption/Ionization (MALDI) [3] and Electrospray Ionization (ESI) [4] made possible analysis of intact polypeptides and proteins by MS. Along with these ionization methods, three technologies combined to provide an analytical platform underpinning the field of MS-based proteomics and enabling system-wide protein analysis. First, nanoscale reversed-phase liquid chromatography (nanoLC) coupled online with MS instruments came about for separating peptide digests from complex protein mixtures [5]. Second, tandem mass spectrometry, commonly referred to as MS/MS, arose for predictably fragmenting peptides, necessary for determining their amino acid sequence [6]. Tandem mass spectrometry initially scans all mass-to-charge (m/z) values of peptide ions as they elute from the nanoLC column, and records their signal intensities in an MS1 spectrum. Detected peptide ions are then isolated, and fragmented, with the instrument undertaking another scan of all m/z values of fragment ions, recording their signal intensities in an MS2 spectrum. Third, automated sequence database searching, led by the program SEQUEST [7] and followed by Mascot [8], was developed to match large amounts of MS2 spectra to peptide sequences contained in databases, and in turn infer protein identities present within complex mixtures.

This basic platform for what has been termed “shotgun” or “bottom-up” proteomics, offered researchers a new way forward for identifying proteins within complex mixtures. However, two problems, the extreme chemical heterogeneity and large dynamic range of protein abundance within protein mixtures derived from cells, tissues or bodily fluids, required new methods for more sensitive identification of proteins. Multidimensional liquid chromatography-based methods for fractionating peptide digests upstream of MS analysis, helped to, at least in part, address these problems, by simplifying complex mixtures and minimizing signal suppression within the MS instrument [911]. These fractionation methods also overcame the limitations [12] of traditionally used two-dimensional gel electrophoresis (2DGE) for separating complex protein mixtures. Methods for enriching PTMs prior to MS analysis improved identification of proteins carrying important modifications, such as phosphorylation [1315] or glycosylation [16], on a large-scale. Stable isotope labeling and dilution, traditionally used in mass spectrometry analysis of small molecules, was adapted for quantitative measurements of proteins analyzed by MS [17].

Collectively, these components of the MS-based proteomics “toolbox” fostered a new and powerful means to study proteins on a system-wide level. This enhanced platform can now routinely identify and quantify thousands of proteins, including those carrying PTMs in complex protein mixtures. Because proteins are the ubiquitous molecular “effectors” within any organism, MS-based proteomics applies to all fields of biological research, including the effects of radiation on the cellular environment.

MS-based proteomics has always been and remains a collection of dynamic technologies, with new ones constantly emerging across all facets of the platform. Continuous improvements in technologies have moved the proteomics field closer to its ambitious goal to fully characterize proteins within complex biological samples with high throughput. Examples of such technologies include: improvements in MS instrument sensitivity increases identification of low-abundance proteins; more sophisticated software programs for peptide identification from MS2 data enables detecting a higher proportion of the hundreds of known PTMs [18] of proteins; higher throughput and more quantitatively accurate methods makes possible quantification of protein targets of interest in a large number of individual samples, which is especially important for biomarker studies. However, despite continued technological improvements, the sheer complexity of biological systems greatly challenges the current platform in meeting the goal of full proteome characterization. To illustrate using a rough estimation, the human genome contains about ∼25,000 genes that are processed by a variety of regulated steps (mRNA splicing, proteolysis, etc.) to produce ∼250,000 distinct proteins. These are in turn covalently modified via phosphorylation, acetylation, ubiquitination, oxidation, sumoylation, etc., to generate a proteome with millions of distinct protein-based molecules. The current proteomics technologies can still only reliably detect a fraction of these molecules. Thus there is continued need for new and improved technologies.

Here, we provide our view on three emerging technologies in MS-based proteomics that are pushing the field in new directions: (1) New instrumental methods; (2) New computational methods for peptide identification; and (3) Label-free quantification. Figure 1.1 provides an overview of the interconnectivity of these technologies. The data produced using new instrumental methods, in particular high resolution and mass accuracy data, enables improved de novo peptide identification, which seeks to overcome the inherent limitations of the currently practiced sequence database searching. Label-free quantification provides a flexible and simple way when comparing samples to determine differentially abundant peptides and inferred proteins. We review recent advances in these three technologies.

Fig. 1.1
figure 1

Overview of the three, interconnected technologies reviewed in this chapter

1.2 New Instrumental Methods

From the outset, MS instrumentation has been the core technology driving proteomic advances. Fortunately, impressive improvements to the technology have continuously emerged over the last two decades. Most instrument vendors introduce a new model of any given MS instrument every 2–3 years, and those manufactured ∼10 years prior to the latest model can scarcely be considered suitable for research. Some of the most fundamental and sought-after metrics for mass spectrometers are resolution, scanning speed, and sensitivity. These are strongly related: in mass spectrometers sensitivity comes at the cost of scanning speed which, in turn, comes at the cost of resolution. Here we review some emerging MS instruments that are redefining what is possible in MS-based proteomic studies. We also discuss emerging methods that are closely linked to improving the performance of the MS instrumentation used for proteomic studies.

1.2.1 Higher Mass Accuracy and Faster Scanning Instruments

Bottom-up proteomics uses nanoLC for peptide separation coupled directly with the MS. Peptides eluting from the nanoLC column are ionized via ESI, and introduced into the mass spectrometer. For complex mixtures (e.g., cell or tissue lysates), the number of peptides vastly exceeds the peak capacity of the separations typically used. Michalski et al. have determined that during a typical 90-min gradient LC run of a complex proteomic mixture, a state-of-the-art mass spectrometer can detect over 100,000 peptide species [19]. Consequently, there are many peptide ions being introduced to the mass spectrometer simultaneously and high resolution is required to differentiate these molecules by their m/z values. Resolution is defined as the ratio of the m/z value to the width of the peak at half its maximum. Therefore a large ratio, for example 50,000/1, is desirable. While not directly related, high mass accuracy usually accompanies high resolution. Mass accuracy is calculated via the following equation: [(actual m/z–observed m/z)/actual m/z]. Because this ratio is usually very small, it is multiplied by 106, and reported in units of parts-per-million (ppm). Values of 5 ppm or less are desirable for mass accuracy. Ideally, a mass spectrometer provides sufficient mass accuracy to assign a unique elemental composition, and thus an estimation of amino acid composition, to all peptide peaks in the scanned range. Such a high level of mass accuracy greatly constrains the number of possible amino acid sequences responsible for an observed signal and reduces the incidence of false discoveries when assigning amino acid composition [20]. With 1 ppm measured mass accuracy, the amino acid composition of relatively small peptides with molecular weights in the range of 700–800 Da can be determined [21]. With the help of internal calibration techniques, achieving this level of accuracy now is almost routine [22, 23].

Tandem mass spectrometry is the underlying instrumental analysis method for MS-based proteomics. In its traditional implementation, detected peptide ions eluting from the nanoLC column are isolated and fragmented, with the m/z values of the fragments being recorded in an MS2 spectrum. There are numerous ways in which isolated peptides can be fragmented, as will be discussed in Sect. 1.4. The most-used method, collision-induced dissociation (CID), leaks or “bleeds” a small quantity of an inert gas (He, N2, Ar) into the chamber where the isolated peptide ions reside. The peptide ions collide with the gas and internalize the energy from the collision. Being in the gas phase, the peptide ions cannot re-distribute the energy to solvent molecules. Instead, the energy is eventually transferred to a vibrational mode which cannot sustain the energy available and results in bond cleavage. This primarily results in cleavage along the peptide bond of the peptide backbone, although one also frequently sees the loss of water, ammonia, carbon monoxide or labile post-translational modifications [24]. The predominant fragments are named accordingly: b-ions are fragments derived from the N-terminus of the peptide, while y-ions are fragments derived from the C-terminus (see Fig. 1.3 in Sect. 1.2). Certain high-energy fragmentation techniques fragment or completely lose the amino acid side chains, and such ions are named d, v, and w- ions.

Ideally, one would acquire a high-quality MS2 spectrum for each peptide within a complex mixture. Unfortunately this is not the reality, due to two main factors. First, the speed at which an instrument can gather a sufficient population of peptide ions and generate an MS2 spectrum will determine its effectiveness at sequencing all the detected peptide ions in a sample. Since the peptide signal from the LC column is transient, the more time spent scanning m/z fragments from any peptide ion selected for fragmentation, the more signals from other peptides will be missed. Thus instruments that quickly scan and record MS2 spectra are desirable. A second factor is the dynamic range of abundance of the peptides present. The electrospray process can generate only a finite amount of ions per unit time, and when an extremely abundant peptide elutes from the column, less abundant peptides will undergo so-called ion suppression. Instruments with a greater dynamic range or efficiency at selecting low-abundance ions can mitigate these effects; however peptides from the lowest-abundance proteins in a sample remain undetectable unless enrichment or targeted strategies are employed. Thus, instruments with increased sensitivity to low-abundance peptides are desirable. Increased sensitivity is also linked to scan speed, as increased sensitivity means the instrument must spend less time accumulating fragment ions, and can record MS2 spectra more rapidly.

Some recently released instruments, combining the desirable qualities of high resolution and mass accuracy and rapid scanning speed, are the Thermo Orbitrap series and the AB Sciex Triple TOF 5600. The Orbitrap mass analyzer allows ions to orbit a central electrode while simultaneously oscillating axially. This axial motion is mass (−to charge) dependent. The Orbitrap analyzer collects an image current of all ions present, each with a characteristic axial frequency. Fourier transform of this image current yields the orbital frequencies present and thus, the m/z values present. In this type of mass analyzer, resolution increases with longer scans [25, 26]. The first commercial Orbitrap instruments, coupled with a linear trapping quadrupole, delivered resolving powers of >100,000 with measured mass accuracy of 2–5 ppm and recording of up to three low-resolution MS2 scans per second [27]. Recent introduction of the Orbitrap Velos, led to 10 low-resolution MS2 spectra recorded per second [28]. The latest installment of the Orbitrap series, Orbitrap Elite, employs a more powerful Orbitrap mass analyzer, [29] providing 2–3 fold higher resolution of up to 240,000, and an improved Fourier transform algorithm, delivering a further 2.3-fold greater resolution. No publications using this instrument exist at the time of writing, however a recent publication describes a related instrument. The Q Exactive, employing the same Orbitrap as the Elite but with a detectorless trapping quadrupole, requires that all MS1 and MS2 scans be performed in the Orbitrap, thus giving high mass accuracy in all mass spectra and allowing stricter filtering criteria when performing database searches for assignment of peptides to MS2 spectra. This instrument also records 10 high mass accuracy MS2 spectra in a ∼1 s cycle that includes an initial MS1 scan [30]. When coupled to an ultrahigh pressure LC system delivering a 4-h gradient, the Q Exactive achieves 92% coverage of the yeast proteome.

The AB Sciex Triple TOF 5600 is in fact a quadruple time-of-flight (Q-TOF) configuration. Relative to other Q-TOF instruments however, the 5600 has improved ion sampling, rapid pulsing of ions towards the TOF and high TOF acceleration voltages, all of which allow up to 100 MS2 recorded per second [31]. One of the first publications using this instrument in a proteomics setting determined that 20 MS2 scans per 1.3 s cycle gave the most peptide assignments to acquired MS2 spectra. This instrument delivers a resolution of 40,000, and, with internal calibration, also produced a measured mass accuracy of 2 ppm. The extremely fast scanning of this instrument is credited for the threefold increase in peptide identifications over an early-model Orbitrap instrument.

1.2.2 Improved Electrospray Ion Transfer Efficiency

Maximized capture in the mass spectrometer of peptide ions generated via ESI increases the instrument’s sensitivity. The ESI process generates a divergent ion beam which is collected by a conductance-limited aperture, typically in the form of a skimmer. This configuration captures only a fraction of the ions generated by ESI. The efficiency of ion transfer from an ESI source to the detector has been estimated at <0.1%. To address this bottleneck, Smith and colleagues have produced many refinements to the long-known stacked ring ion guide [32] or ion funnel (Fig. 1.2), yielding successively improved ion transmission while minimizing the m/z dependency of ion transmission [3336]. The ion funnel consists of a series of evenly or progressively further-spaced ring electrodes with successively decreasing inner diameters to help focus the divergent ion beam. Radio frequency [32] or static [37] electric fields are used to drive ions through the device sometimes with a DC field superimposed [36]. The ion funnel has achieved collection efficiencies of 50–60% across a typical proteomics m/z range of 200–2,000. However this interface is still not efficiently coupled to a mass spectrometer due to an increased concentration of charged droplets whose repulsion causes losses during transmission [38].

Fig. 1.2
figure 2

Ion funnel schematic (Adapted with permission [39], Copyright 2008 American Chemical Society)

Ion funnels have become widely adopted in commercial mass spectrometers, however the issue of ion transmission remains a barrier to efficient use of all ions generated by ESI. Another factor limiting ion transmission is the pressure gradient between the ESI source and the mass spectrometer. ESI is usually operated at ambient pressure, while the mass spectrometer is operated under high vacuum. This pressure gradient makes efficient ion capture difficult. Operating the electrospray within the low-vacuum region of MS has been used to mitigate this problem and improves ion signal by approximately an order of magnitude [40, 41], giving an estimated 50% ion transmission efficiency [42]. This technology has been termed subambient pressure ionization with nanoelectrospray, or SPIN. Potential, albeit minor hurdles to the widespread adoption of this technique are its compatibility with typical nano-LC flow rates [41] and the robustness of an interface where users are required to introduce a nano-LC-ESI column into the first vacuum stage of a mass spectrometer. However, more efficient use of ions produced by ESI clearly pays dividends in improving instrument sensitivity and will likely continue to see innovations until a majority of ions can be routinely captured in commercial-grade mass spectrometers.

1.2.3 New Fragmentation Methods

Tandem mass spectrometry for amino acid sequence elucidation relies, in part, on the ability to efficiently fragment peptide ionsFootnote 1. CID is by far the most used fragmentation method due to its simplicity, ease of implementation, and ability to fragment all peptides at least moderately well, in spite of the wide chemical diversity of a typical proteomic sample. As CID is well-suited to other classes of molecules beside peptides, it is present in virtually all commercially available tandem mass spectrometers [43]. CID is sometimes referred to as collisionally-activated dissociation (CAD), particularly when applied to the beam-type version of this fragmentation. This distinction causes one to view CID as a method limited to resonant excitation in an ion trap. Adding to the confusion, one instrument vendor refers to their CAD cell as Higher energy Collisional Dissociation (HCD) [44]. These distinctions are not merely semantic, however. Aside from the differences in hardware required to perform them, resonant excitation CID is slower (∼30 ms vs. <1 ms), produces different fragment ion intensities than beam-type CAD, and CID suffers from the so-called “one-third rule”. Under the necessary ion activation conditions for sufficient fragmentation in an ion trap, the resulting fragment ions with m/z ≤ 0.3 times that of the precursor ion are lost during the activation [45]. This loss of low-mass b/y ions as well as helpful immonium ions hinders the interpretation of the resulting tandem mass spectrum. In isotope-tagging experiments for relative quantification between samples, the low-mass region of the MS2 spectra contain the reporter fragment ions, crucial to obtaining quantitative information [46, 47]. While a modified version of resonant excitation in an ion trap, called PQD, can be implemented to preserve the low-mass ions, [48] the low fragmentation efficiency of PQD has invited comparison between PQD and HCD [49]. Instrumental improvements in the HCD cell [50] now make HCD a very attractive method for the analysis of low-mass reporter ions in quantitative mass spectrometry.

The use of HCD fragmentation has also been examined for studies of phosphopeptides. While its use resulted in more phosphopeptide and phosphosite identifications than when using CID, [51] it seems that this improvement can be attributed largely to the high mass accuracy scans which are mandated following HCD, as opposed to low-resolution and low mass accuracy scans typically used following CID, rather than any inherent improvement in fragmentation pattern or ion collection.

Electron capture dissociation (ECD) [52, 53] and electron transfer dissociation (ETD) [5456] are two related fragmentation methods, used in Ion Cyclotron Resonance (ICR) and ion trap mass spectrometers, respectively. Both methods involve an ion/ion interaction between a multiply protonated peptide cation and either a low-energy electron in ECD or an electron-donating anion radical molecule in ETD. The charge-reduced peptide cation dissociates before any energy randomization can occur. This is especially important for peptides carrying PTMs such as phosphorylation or glycosylation [57]. In CID or CAD the labile covalent bonds between these modifications and the peptide are usually preferentially fragmented, limiting fragmentation across the peptide backbone and resulting in less informative MS2 spectra for peptide sequence assignment. ECD or ETD meanwhile provide richer MS2 spectra from many PTM carrying peptides since much of the fragmentation still occurs along the peptide backbone. This leaves intact the modified amino acid residues and also provides a relatively full complement of sequence-rich fragments, enabling more effective sequence assignment and increased confidence in the site of modification. This property has made ECD/ETD especially useful in studies of phosphoproteins and glycoproteins [5861].

Photodissociation methods can also be used to obtain peptide sequence information. Two spectral regimes are commonly being used for this purpose: infrared and (vacuum) ultraviolet. Infrared multiphoton dissociation (IRMPD) typically uses a CO2 laser emitting tens of watts at a wavelength of 10.6 μm. This wavelength is efficiently absorbed by phosphopeptides and thus IRMPD has been investigated for its utility in analyzing this important post-translational modification [62, 63]. MS2 spectra following IRMPD do not suffer from the one-third rule, [64] however the fragmentation typically takes twice as long as resonant excitation CID. IRMPD produces b/y ions, but also yields more internal fragment ions than CID [65].

UV photodissociation typically uses excimer lasers emitting at 157 or 193 nm as the light source [66]. As air absorbs these wavelengths efficiently, the 157 nm light source especially must be placed in the vacuum region of the mass spectrometer, complicating the instrumental requirements. Single-photon UV absorption is sufficient to induce dissociation and in contrast to IRMPD, irradiation times on the order of μs or ns are sufficient. While both 157 and 193 nm light target the peptide backbone, UV photodissociation produces a range of fragments in addition to b/y ions such as a, d, x, v and w ions. The presence of d, v and w fragment ions is evidence of a high-energy fragmentation method; not surprising given that the energy of a single UV photon is approximately double that of a peptide bond [67]. While these fragments can be useful, for instance, in differentiating between leucine and isoleucine, most commercially available peptide identification programs are not optimized for, or capable of, analyzing these ions.

While there currently exists an impressive, if not overwhelming, array of dissociation methods, none can meet all the requirements of every conceivable experiment. Until such a method exists, there remains room for improvement to those currently used, and the development of entirely novel ones.

1.2.4 Data-Independent MS2 Analysis

Most tandem MS experiments are performed in a data-dependent manner: the collection of peptide ions entering the mass spectrometer are first recorded in the MS1 spectrum, and these ions (also called precursor ions) are serially selected for fragmentation and MS2 spectra acquisition [68]. It is well-established however, that this method does not provide complete selection of all peptides in complex samples. An alternative method is to perform fragmentation at all peptide ion m/z values, regardless of which ions can be detected in an MS1 spectrum. This method is embodied by two different approaches: with and without isolation of precursor ions within a defined m/z window.

In data-dependent MS2 spectra acquisition, all ions within a defined, relatively narrow m/z window bracketing a precursor of interest are isolated for fragmentation. It however is possible to omit precursor m/z isolation and effectively fragment simultaneously all ions present across the entire m/z range scanned when acquiring MS1 spectra. When precursor isolation is omitted, a single LC-MS run could in theory detect the entire proteome. In practice, the usual list of mass spectrometer capabilities is desired: high mass accuracy is tremendously beneficial in assigning fragment ions to precursor ions [69, 70]. High scan speed and MS2 spectra acquisition is beneficial in assigning fragment ions to a precursor, based on chromatographic retention time [71, 72]. Dynamic range of the mass spectrometer is also important in achieving deep sequencing, due to the occurrence of co-eluting peaks [73]. One advantage of this approach is that it can be performed on relatively simple instrumentation: only a collision cell (or other means of achieving dissociation [63]) and a single-stage mass analyzer are required. One major challenge in this type of experiment is the data analysis. Knowledge of which precursor ion masses give rise to the observed fragments is necessary for assigning peptide sequence to MS2 spectra using sequence database searching software. When simultaneously fragmenting multiple peptide ions across a large m/z range, knowledge of which precursor ion belongs to which fragments is lost. While the precursor mass belonging to sets of fragments can be inferred by relating its retention time in an MS1 scan to that of the fragments ions in an MS2 scan, [71] this is not a trivial process [72]. As such, data-independent MS2 in the absence of precursor isolation still struggles with very complex samples, but this approach seems to be re-evaluated each time a breakthrough in hardware performance is made.

Recently, another data-independent acquisition approach was investigated by rapid isolation and fragmentation of peptide ions within narrow (2.5 m/z) precursor isolation windows, spanning the entire m/z range covered by peptide ions (∼400–1,400). These narrow m/z “bins” mitigated the need for MS1 scans, while still providing a tight mass range of potential precursor m/z that could be connected to each MS2 spectra for sequence assignment. For thorough analysis of a typical, complex protein digest, this approach required over 4 days of mass spectrometry instrument time, but required no sample pre-fractionation [74]. Wider isolation widths have been tested, but the resulting tandem mass spectra are likely to contain more than a single peptide species, resulting in complicated database searches [75]. The use of narrow isolation widths demonstrated the ability for a highly automated method to achieve greater proteome coverage and a wider dynamic range than a data-dependent method. As with experiments that do not use precursor isolation, such studies using narrow isolation widths benefit from instrumental improvements such as high mass accuracy and resolution [76]. It is somewhat surprising how few publications exist on this topic, as it seems well-suited to those experimenters not well versed in multidimensional peptide fractionations who might be attracted to a highly automated method. At this time it is difficult to predict whether the data-independent approach will flourish or flounder, in spite of its demonstrated potential.

1.2.5 Gas-Phase Fractionation and Ion Mobility Separations

Since the complexity of a typical proteomics sample can easily exceed the capacity of a LC-MS system to resolve and detect all peptides present, most fractionation schemes [911] occur upstream of the mass spectrometer, and are designed to simplify the mixtures introduced into the mass spectrometer to achieve better sensitivity. However, peptide fractionations usually require considerable manual labor and sample handling.

In contrast to upstream fractionation, a fractionation method has been devised wherein repeated injections of the same unfractionated sample are introduced to the LC-MS, but for each injection a different “fraction” of the standard m/z range is analyzed (e.g. 400–575, 560–740, 730–910 and 900–1,795). This allows the instrument to focus on a smaller m/z range to achieve the most comprehensive detection and fragmentation of peptide ions in this range as possible [7779]. Since the instrument analyzes or ignores certain portions of the ionized m/z range, this method has been termed “gas-phase fractionation”. For a yeast cell lysate, the analysis of three gas-phase fractions was compared to triplicate analyses of the entire mass range, and found to increase the number of identifications by 30% [80]. A further refinement of this method used in silico calculations to determine the optimal m/z bins which would yield equal numbers of theoretical tryptic fragments across the number of bins selected [81]. The authors studied three different organisms of differing complexity, and found that regardless of the biological source, roughly half the tryptic peptides reside below m/z 685 with decreasing ion density as m/z increased. Thus gas-phase fractionation certainly has the power to increase proteomic coverage, but at the cost of performing multiple LC-MS runs. Unlike upstream peptide fractionation methods, gas-phase fractionation does this in an entirely automated fashion, reducing labor and sample handling. However, this method might not be suitable to the analysis of very small samples with low protein amounts where multiple LC-MS analyses are not possible.

Ion mobility spectrometry (IMS) is a gas-phase separation method for electrophoretically separating ions in the presence of a buffer gas. Ions are separated by their mass, charge and mobility; the latter being inversely related to their collisional cross section [82]. IMS devices are frequently coupled to a mass spectrometer (using ion funnels), creating a hyphenated method, IMS-MS. For the purposes of this section it will be assumed that all IMS separations are coupled a mass spectrometer. The time frame of a typical IMS separation is ideally suited to its incorporation in a multidimensional fractionation scheme in proteomics: the peak widths for LC, IMS and TOF-MS are on the order of seconds, ms and μs, respectively. This allows each subsequent method to acquire tens of measurements of the preceding separation—the minimum required for adequate profiling of a peak [83].

Three versions of IMS are used: linear drift tubes, traveling wave ion guides, and field-asymmetry IMS (FAIMS) [69]. Linear drift tubes and traveling wave ion guides both resemble a stacked ring ion guide (see Sect. 1.3), though differ in the way electric fields are applied in order to propel ions through the device. These differences affect the separation mechanism. The resolution of linear drift tubes and traveling wave ion guides [84] is typically the greatest at 100–150, however similar values have recently been reported with FAIMS [85, 86]. Also, FAIMS typically separates isomers and isobars better than linear drift tubes and hence has been the most widely implemented in proteomics experiments [8587]. To date, the most successful configuration for FAIMS is the use of parallel plates separated by ∼2 mm [88]. Under high electric fields, the absolute mobility of an ion deviates from its value at low fields. This difference is exploited in FAIMS by applying an asymmetric radio frequency potential between the two plates. As this potential ejects all ions radially from the device, a DC compensation voltage is required to transmit any ions. This compensation voltage is the discriminating variable in a FAIMS separation [87]. Both Thermo Scientific and AB Sciex have commercially-available FAIMS devices which can be added to their mass spectrometers, boasting claims of improved selectivity and signal-to-noise ratios. Shvartsburg, Smith and co-workers have made great improvements in the instrumental design of FAIMS devices, improving resolving power [87, 88] and resolving phosphopeptide isomers which differ only in the site of phosphorylation [89]. Waters Corporation has investigated and commercialized a traveling wave ion guide with its mass spectrometers. As the name implies, a DC voltage is passed along the successive ion guide rings, propelling the ions through the device while an rf-field is generated to maintain ions’ radial position. By selecting the amplitude and velocity of the DC wave, ions can be separated by mobility or simply transmitted through the device [90, 91].

Ion mobility separations have the ability to add a second dimension of online fractionation to an LC-MS analysis which should greatly simplify the mixture of ions arriving at the mass spectrometer with no increase in analysis time. The resolving power is sufficient to separate different components in a mixture, however a single peptide sequence may have multiple conformations each with different IMS mobility, which de-focuses the ion packed generated by LC-MS. Also, it is not clear whether current computational methods can analyze IMS separations quickly enough to make on-the-fly decisions, as is currently performed in data-dependent LC-MS experiments. Nonetheless, it seems probable that these hurdles can be overcome and that IMS separations will greatly increase the power of proteomics experiments.

1.2.6 Targeted MS

As the collection of known, MS-observable, proteolytically-derived peptides becomes saturated, some researchers are turning away from data-dependent MS analyses. For a known sample type (e.g. identity of the organism, biological state, sample preparation parameters), the observable peptides emanating from its proteome can be predicted and have likely already been observed in other experiments. Thus, generating a comprehensive list of such so-called “proteotypic” peptides should provide a basis for performing targeted MS experiments in a hypothesis-driven manner [92]. Such methods can then be used, for example, to validate potential biomarkers generated from an initial screening experiment [93] or to follow the proteins in a metabolic pathway following some perturbation [94]. This approach has been greatly advanced by the groups of Aebersold and Carr, who have developed software to predict the most detectable peptides in a mixture, [95, 96] catalogued all experimentally observed peptides [97], demonstrated single copy per cell sensitivity, [94] and are in the process of synthesizing a complete proteotypic peptide library of human serum [98].

A powerful MS method for quantifying several peptides simultaneously is termed selected reaction monitoring (SRM), or sometimes multiple reaction monitoring (MRM). This type of experiment is performed with a triple-quadrupole instrument, and is notoriously selective and sensitive. As ions are electrosprayed into the MS, the first quadrupole transmits a peptide ion at a user-specified m/z value. This ion is then fragmented in the second quadrupole which is not mass-selective, but merely a fragmentation cell. The third quadrupole is then set to transmit the m/z of an expected fragment ion from the precursor peptide ion. This process is repeated, usually for at least three fragments per peptide ion and two proteotypic peptides per protein of interest. Modern mass spectrometers can achieve reliable quantification by dwelling on such a peptide/fragment m/z pair (the “reaction” in SRM, also called a transition) for 10 ms or less. The duty cycle, and thus the sensitivity is inversely related to the number of transitions being monitored, however when the retention time of a peptide is known, the instrument can be scheduled to monitor distinct peptide ions and their transitions at different times. In this way, Kiyonami et al. have quantified 6,000 transitions, relating to 757 peptides in a single LC-MS analysis [99]. They note that this can be extended to 10,000 transitions, targeting 1,000 peptides. Addona et al. [100] have shown that, when using isotopically-labeled standards, this method is very reproducible within and across eight laboratories using two instrument platforms. Many groups believe that SRM-based targeted proteomics will be the basis for future biomarker validation [101104].

An important aspect of such large-scale hypothesis-driven efforts is the software. The identification of proteotypic peptides and their SRM transitions can be very time-consuming if performed manually. A variety of software products exist from the instrument manufacturers and from academic groups to assist in the design of SRM experiments [105107]. The most powerful and popular tool has come from the MacCoss laboratory. Their open-source platform, Skyline, can guide SRM experiments by optimizing collision energy and fragment ion selection, performing quantification, predicting peptide retention time and a host of other functions, for data acquired from the major instrument manufacturers [108110]. Continued refinements to such software packages will greatly automate and thus expedite the process of developing and optimizing SRM assays capable of quantifying hundreds to thousands of peptides in a single MS analysis. These advancements are transforming MS-based proteomics from just a large-scale discovery technology to a high-throughput assay for monitoring proteins of interest in hypothesis-driven studies.

1.3 New Peptide Identification Methods

1.3.1 Principles of MS2 Fragmentation

Tandem mass spectrometry-based proteomics experiments rely on the same principle as Edman degradation, a long standing chemical technique for peptide sequencing [111]. In Edman degradation stepwise degradation from the peptide’s n-terminus followed by chromatographic analysis of the released derivatives determines the amino acid sequence. The fragmentation that occurs during the MS2 stage mimics Edman degradation because MS2 dissociation randomly breaks along the backbones between amino acid residues. This results in two, rarely more, fragment ions, one each containing the n-terminus and the c-terminus. The m/z values of fragment ions are recorded in the MS2 spectra for every selected precursor peptide ion. However, individual fragmentation peaks are not valuable; as in Edman degradation, it is their m/z differences that are informative. As shown in Fig. 1.3, the m/z differences between these peaks determine both the amino acid residue identities and their positions, thus identifying a peptide.

Fig. 1.3
figure 3

Example peptide MS2 spectrum

These two fragment ions have predictable structures because as shown in Fig. 1.4, fragmentation can only occur in three places along an ion’s backbone. Therefore, the fragment ion will resemble one of six ion structures. The standard nomenclature for these fragment ions identifies both the point of fractionation as well as which terminus retains the charge. Ions a, b and c are n-terminus fragments and x, y and z are c-terminus ions.

Fig. 1.4
figure 4

Peptide fragmentation locations and the resulting ions

Although the exact point of fragmentation depends on many factors, the primary factor is the type of dissociation applied. CID and HCD produce primarily b and y ions, with a few a ions sprinkled in, while ETD produces primarily c and z ions. Their resulting fragmentation patterns differ enough to impact the programs interpreting mass spectra.

1.3.2 Interpretation of MS2 Spectra

With just one experiment generating hundreds of thousands of MS1 and MS2 spectra with high resolution, today’s mass spectrometers now offer unparalleled mass accuracy and efficiency. Coupled with the increasing use of new dissociation techniques and chromatography methods, mass spectrometers now generate an overwhelming amount of spectral data with different fragmentation patterns and retention time profiles. Unfortunately, widely used software packages for interpretation of mass spectra, that is, for peptide identification, protein inference and validation, were not designed to process this vast amount of data, and they were tuned to process CID-derived data. Because these tools for interpreting mass spectra have failed to keep pace with advances in instrument technology, they yield suboptimal proteome characterization.

Interpretation of mass spectra is a multistep process. The data must first be preprocessed to remove noise and identify valid peaks and features, subjects not reviewed here, but several good reviews exist in the literature [112115]. After preprocessing the sample data, a series of phases culminates in a list of peptides and/or proteins that are confidently deemed present in the sample. These phases are: peptide identification, protein inference, and validation. In the following sections we highlight their main challenges and solutions, and posit an outlook of their future.

1.3.2.1 Peptide Identification

The first phase, peptide identification assigns an amino acid sequence to a spectrum. This is called a peptide spectrum match or PSM. Peptide identification programs have evolved over time, but strategies for assigning PSMs fall into one of four categories: database search, spectral library search, de novo sequencing, and hybrids thereof.

1.3.2.1.1 Database Search

During the early days of proteomics experiments, peptide identification was completed via manual de novo sequencing, a tedious process carried out by researchers without the aid of a computer or a database [116]. However, soon proteomics experiments became high throughput and the amount of data generated by them outpaced researcher’s ability to manually inspect each spectrum. This drove the invention of alternate means of identifying peptides, mainly database search programs.

Today, researchers avail themselves to the numerous software packages that implement database search programs, see Table 1.1. Researchers still commonly utilize the first widely used database search programs from the 1990s, SEQUEST [7] and Mascot [8]. Although specific implementations of database search programs differ, they share a common underlying principle introduced by SEQUEST: they compare the observed MS2 spectra to that of theoretical spectra derived from in-silico enzymatic digestion of a FASTA database. They also share common challenges. One challenge is how to efficiently search the large amount of data available in FASTA databases. Searching all possible peptides from a FASTA database and all of their potential PTMs is prohibitively time-consuming. Even with the use of multiple processors, sequence assignment, including possible PTMs, to hundreds of thousands of MS2 spectra produced by modern instruments can take days or even weeks. Unfortunately, limiting the peptides only to those with expected enzyme cleavage sites (e.g., lysine and arginine for trypsin cleavage), and limiting the number of PTMs considered, does not adequately narrow the search space. To address this issue, most database search software packages can restrict the search space even further by searching against only those peptides that have a mass within a narrow tolerance window around the observed m/z of its precursor peptide ion. A completely different challenge stems from the fact that different dissociation methods produce very different fragmentation patterns. This was not a problem until recently, because prior to the introduction of ETD, the predominant workhorse of proteomics experiments was CID. But with the introduction of ETD and its increasing adoption comes the requirement for database search programs to allow multiple fragmentation patterns for the same peptide. Because each type of experiment has its own optimal settings for precursor ion mass tolerance window setting, number of PTMs considered and dissociation methods used, the researcher sets these parameters.

Table 1.1 Partial list of database search tools

An inconvenient consequence of parameter driven database search is that each different set of parameters produces different results. Therefore, researchers must exercise caution when comparing results between experiments, both within and between laboratories.

Although database search strategies are the predominant choice for peptide identification in shotgun proteomics [123], they do have limitations. First, database search relies on the sequencing data for organism being studied. Thus, if an organism is not yet been sequenced, database searching can only be used to find homologous peptides in different organisms. Second, unexpected, yet important, PTMs, and sequence anomalies will be missed because variants do not exist in the database. Even though some databases take into consideration splice variants, no production quality database search engines make the effort to take advantage the annotation available in databases such as Swiss-Prot or UniProtKB. Therefore, many unexpected, but annotated, PTMs and polymorphisms are missed, which leads to incorrect or missed peptide identifications [124]. Third, false positive identifications occur often because database search programs assign a peptide sequence to each and every spectrum, regardless of quality. Fourth, validation assigns high confidence (>95%) to only 10–30% of the spectra [125]. Finally, each database search engine identifies a partially overlapping but different set of peptides. For instance, SEQUEST may identify a set of 100 peptides and Mascot may identify a set of 100 peptides from the same spectra, but perhaps only 60 of them are common to both SEQUEST and Mascot.

A relatively recent development in shotgun proteomics research combines the results from several different database search engines to identify more peptides with increased confidence. The idea of combining results from multiple sources is not new. Resing et al. described consensus scoring for multiple peptide identifications from different search engines in 2004 [126] and Alves et al. proposed combining and calibrating confidence scores from multiple search engines into a meta-analytic value for each confidence score [127]. However, software automating integration of separate database search results developed more recently. For instance, a popular tool allowing researchers to combine results from multiple search engines is Scaffold, developed by Searle et al. [125, 128]. By probabilistically combining results from multiple search engines, including SEQUEST, X!Tandem, OMSSA, InsPecT and Mascot, Scaffold increases sensitivity a minimum of 20% with each search engine added [129]. As evidenced by the latest publication of a tool combining results from multiple search engines [130], the idea is garnering more attention and we can we can expect this trend of new tools for incorporating multiple search engines to continue into the foreseeable future.

1.3.2.1.2 Spectral Library Search

Spectral library search strategies are similar to database search strategies, except the observed MS2 spectra are compared to collections of experimentally generated spectra rather than hypothetical spectra [131]. These strategies outperform database search strategies in terms of error rates, speed and sensitivity. Using spectral libraries reduces the time spent repeatedly identify the same identifiable peptides by database searching, [132] but can only identify a peptide if it has been previously analyzed by tandem mass spectrometry and its sequence positively identified. A partial list of spectral library search tools is located in Table 1.2.

Table 1.2 Partial list of spectral library search tools

Libraries of experimental spectra are available from many sources and provide a rich source of spectral data. Spectral libraries for many organisms are stored at the National Institute of Standards and Technology (NIST). Although, the NIST libraries do not target specific PTMs, specialized libraries for specific modifications are available elsewhere, e.g., PhosphoPep [138] for phosphorylation sites in model organisms and the open source Ub/Ubl spectral library [139] for ubiquitin and ubiqutin-like modifications. In addition, a wealth of spectral data can be downloaded from one of several proteomic data sharing repositories, e.g., PeptideAtlas [140], Pride [141], Peptidome [142], and Tranche (https://trancheproject.org/).

As the amount of publicly available spectral data grows, the hope is that one day spectra for all peptides detectable by MS (at least for well-studied organisms) will be contained and annotated in publicly available spectral reference libraries. However, until these reference libraries are sufficiently complete, spectral library search strategies will continue to be underutilized [143]. In the meantime, data in spectral reference libraries are a rich source of data that could be used for purposes other than identifying peptides. For instance, spectral data could be mined to provide important insight into fragmentation patterns, which could in turn lead to improved database search or de novo sequencing [144] as well as the development of SRM methods to target specific peptides [97].

1.3.2.1.3 De Novo Sequencing

The limited ability of database search and spectral search strategies to identify the unexpected, for example, PTMs, polymorphisms and sequence anomalies drives the need for peptide identification programs that can efficiently handle the enormous amounts of data without sacrificing confidence in their results. While conceptually unchanged, researchers are again turning to de novo sequencing as an alternative to database search to accurately and confidently identify peptides. De novo sequencing programs can identify PTMs, polymorphisms and sequence anomalies because they compute directly on spectra to determine the peptide amino acid sequence, process which does not require searching against FASTA databases [145].

De novo sequencing for proteomics has a long and rich history. Spectra were originally sequenced manually, a process which does not scale well. Therefore, as the amount of data from shotgun proteomics grew, researchers turned to computer science for automated de novo sequencing. In the 1980s, several computational algorithms were introduced that helped [146150] but proved to be terribly slow because they tended to brute force consider all possible amino acid sequences. In 1990, computational algorithms became more efficient when Bartels represented a spectrum as a graph [151]. Although this type of graph is called a spectrum graph by the proteomics community, it should not be confused with spectral graph theory where a graph’s spectrum is defined as the set of eigenvalues of a graph’s adjacency matrix, nor with a general graph of nodes and edges, where a node does not have a position and an edge can connect any two nodes. In this novel proteomics spectrum graph representation, the vertices represent the spectrum m/z values and two vertices are linked by edges if their mass difference is equivalent to the mass of an amino acid. Figure 1.5 shows a theoretical spectrum graph of the spectrum in Fig. 1.2. Formally, Bartels defined the problem as:

Fig. 1.5
figure 5

MS2 spectrum graph

Given amino acid masses \( M=\left\{ {{m_1},\ldots,{m_{20 }}} \right\} \), spectrum \( S=\left\{ {{s_1},\ldots,{s_c}} \right\} \), transform it into a spectrum graph \( G\left( {V,E} \right) \) such that \( V=\left\{ {{v_1},\ldots,{v_c}} \right\} \) and \( G=\left\{ {{g_1},\ldots,{g_t}} \right\} \) such that v represents a single integer m/z and two vertices \( {v_n},{v_q},q{\neq}n \) and are connected by directed edge \( e \) if \( \left| {{v_q}} \right.-{v_n}\left| {\sim {m_t}} \right. \)

Table 1.3 Partial list of de novo sequencing tools

Although Bartel’s approach is now the de facto basis for most de novo peptide sequencing programs, several unresolved issues limited spectrum graph’s, and, therefore, de novo sequencing’s, adoption. First, spectrum graph models were instrument specific which required training a new model for each new mass spectrometer. It took until 1997 to implement this strategy when Taylor and Johnson introduced Lutefisk [152]. Second, the predominantly used CID has a propensity for incomplete fragmentation and results in multiple disconnected graphs, graph gaps, limiting the effectiveness of spectrum graph algorithms. Finally, lack of standardized scoring models for spectrum graphs hindered researchers’ ability to compare experimental results. SHERENGA, introduced by Dancik et al. in a landmark publication 1999 [153], addressed each of these limitations using a spectral graph-based algorithm. Several research groups have since made additional enhancements to these algorithms, most notably dynamic programming [154] and probabilistic models using networks learned over annotated spectra (e.g., PepNovo [155] and NovoHMM [156]).

Although several de novo sequencing software packages, as shown in Table 1.3, are now available that implement spectrum graph algorithms, de novo sequencing’s adoption as a viable option for shotgun proteomics experiments has been slow. A primary contributor to its slow adoption was that affordable mass spectrometry instrumentation lacked the ability to produce high resolution spectra with minimal noise, completely fragment selected ions and retain potentially important PTMs. Until recently, these issues could only be overcome by using complementary spectra, MS2 and MS3 [160], ECD and CID [161], or differentially modified pairs [162] on a FTICR mass spectrometer. The FTICR spectrometer can generate spectra with high resolution (>100,000), thus differentiating valid ions from noise much easier. Furthermore, ECD is complementary to CID, and a more complete fragmentation pattern emerges their spectra are combined into a single artificial spectrum. Finally, ECD inherently uses lower energy than CID allowing for retention and subsequent identification of more PTMs. FTICR mass spectrometers’ main drawbacks are that they are extremely expensive and inefficient compared to ion trap mass spectrometers, the main workhorse instruments in MS-based proteomics. Even though FTICR instruments offer 100 times better resolution than an ion trap, each spectrum takes 10 times longer to acquire than on an ion trap instrument.

Recently, via the introduction of the Orbitrap instrument series, the more affordable ion trap mass spectrometers became capable of high resolution (>100,000) and offered ETD, which gives similar spectra to ECD. Taking advantage of these improvements, Datta and Bern expanded on previous pioneering work fusing ECD and CID spectra [163]. In 2009, they introduced Spectrum Fusion which uses a global graph partitioning approach to both separate b and y ions and to fuse CID and ETD. The heart of Spectrum Fusion is a supervised machine learning algorithm (tree augmented naïve Bayes network) trained on confidently identified spectra from a prior database search. The result is a synthetic spectrum with only b ions which can then be sequenced by a slightly modified spectrum graph de novo algorithm.

1.3.2.1.4 Hybrid Strategies: De Novo & Database/Spectral Library Search

Despite advances in both mass spectrometry instrumentation and software programs, incomplete fragmentation remains an open issue for de novo sequencing strategies. However, when dissociating thousands of ions, they often break along the backbone in enough places so that de novo programs can sequence and identify short peptide sequences which are typically 3–5 short amino acids in length. Again, these ideas are not new. In fact, Mann and Wilm introduced the notion of using short peptides sequences, which they called sequence tags, in 1994 [164], the same year as SEQUEST. However, strategies based on sequence tags did not appear until Tabb et al. published GutenTag program in 2003 [165]. The innovation of GutenTag is that it constructs a model spectrum of the peaks expected from a given sequence tag, compares the observed spectrum and the model spectrum, and generates a correlation score. Tabb et al. went on to provide an enhanced database search tool MyriMatch [120], which is tuned to use these short peptide sequences to infer candidate proteins. Hybrid peptide identification strategies using sequence tags are gaining popularity and several hybrid tools are now available (Table 1.4).

Table 1.4 Partial list of hybrid search tools

1.3.2.2 Protein Inference

While peptide identification is a necessary phase in proteome profiling, it is not the last one. Proteins must be inferred from the list of peptides identified. However, the task of assembling peptide identifications to infer proteins present in a sample, known as the protein inference problem, is far from trivial [169]. First, the connection between peptides and proteins is lost during enzymatic digestion. This is so because of multiple proteins sharing peptides. The sources of these shared peptides, also known as degenerate peptides, include both natural and artificial phenomena. Degenerate peptides arise often in nature, especially in eukaryotic organisms due to the presence of homologous sequences or splice variants. To make matters worse, errors and redundancies in the database being searched add even more, albeit artificial, degenerate peptides [170]. Regardless of their source, degenerate peptides limit the ability to differentiate between proteins resulting in an unsatisfactory level of ambiguity. This drives the need for validation of results.

1.3.2.3 Validation

MS-based proteomics results are inherently prone to inaccuracies. Without careful filtering, its results are riddled with false positive identifications at both the peptide identification and protein inference levels. To reduce the number of false positives, several scoring models have been proposed and developed to impart a confidence level on identified peptides and inferred proteins. To date, because no single scoring model dominates, different software packages employ their own scoring models. SEQUEST, X!Tandem, Mascot and OMSSA employ variations of a cross correlation (XCorr) score which measures the similarity at different offsets between pre-processed observed spectra and hypothetical spectra generated by in-silico digestion. Mascot differs slightly from other XCorr based scoring models in that it assesses the probability of a peptide spectrum match being a random event. Other software packages use scoring models based on empirically observed rules, SpectrumMill, or incorporate statistically derived fragmentation frequencies, PHENYX [122].

Each of the thousands of single peptide identifications or protein inferences can be assigned an individual score. However, single case scores do not take into consideration the fact that multiple hypotheses are being tested. Therefore, in addition to using a single statistic, p-value, its close relative for multiple testing, E-value, is often used. p-value, assuming the null hypothesis is true, represents the probability of obtaining a test statistic at least as extreme as the one observed. E-value, assuming the null hypothesis is null, is the expected number of times in multiple testing to obtain a test statistic as extreme as the one that was actually observed. Put more simply, E-values are derived by taking the number of tests multiplied by the p-value. To account for multiple hypotheses testing, many controlling measures have been proposed. Bonferroni correction is used to control Family Wise Error Rate, FWER, which is the probability of finding at least one false positive. However, the Bonferroni correction has been shown to be too conservative given the thousands of hypothesis tests in a single experiment [171].

Less conservative than the Bonferroni correction is the False Discovery Rate (FDR) controlling procedure, introduced by Benjamini and Hochberg [172]. They define FDR as the “expected fraction of mistakes among the rejected hypothesis and suggested to control FDR in multiple testing”. A well-established mechanism to implement FDR for database search results is to search against a decoy FASTA database of invalid peptide sequences, most often concatenated to the end of the target FASTA database with valid peptide sequences [173]. The premise for this approach is that a spectrum will match valid and random (invalid) sequences with equal probability and target and decoy sequences do not overlap. Although decoy databases are intended to be random, in practice they are most often constructed by reversing, shuffling or randomizing the target FASTA database [174].

With the introduction of FDR as a controlling procedure, publications ensued discussing the proper use of statistical values. Kall et al., argue that using a p-value threshold for FDR is inadequate because the statistical test is performed so many times [175]. It also has the unfortunate property that two different p-value scores can result in the same FDR. To address this problem, Storey and Tibshirani [176] propose a q-score, which when applied to shotgun proteomics, is the defined as the minimum FDR threshold at which a given PSM will be accepted.

Historically, to implement the FDR controlling procedure with a decoy database, researchers accepted all identifications above a certain threshold [177]. This threshold was usually a combination of scores provide by the database search engine. However, problems exist in this strategy, including the need to have separate thresholds for different types of instruments. To overcome problems with the threshold scheme, early validation tools, e.g., QSCORE [178], were developed that employed simple probability, but focused on results from a single search engine.

Because threshold statistical models tend to be instrument specific, researchers turned to machine learning, notably mixture modeling, to build a generic model that could process results from multiple instrument types. Mixture modeling uses models of two normal distributions, one for correct identifications and one for incorrect distributions, to determine a score threshold. Perhaps the most widely used example of mixture modeling for peptide identification validation is Keller et al.’s PeptideProphet [173]. It uses a discriminant score which is derived by converting several scores from the database search programs into a single score. To apply a two-component mixture model, PeptideProphet creates a histogram of discriminant scores and uses curve fitting to draw the correct and incorrect distributions. Using Bayesian statistics, it computes the probability of an identification being correct given its discriminant score. Similar to PeptideProphet is ProteinProphet, which is used to validate protein inferences. It uses results from PeptideProphet as input to accurately compute the probability that an inferred protein is present in the sample [179] and derives a mixture model of correct and incorrect protein inferences, using an expectation-maximization routine (EM). Since PeptideProphet/ProteinProphet is open source and freely available and integrated into the Trans-Proteomic Pipeline, TPP, it is an attractive option for interpreting mass spectra as evidenced by its use in a number of prominent laboratories.

Although scoring based on mixture modeling can accurately model incorrect and correct score distributions, they are inherently complex and not easily extensible [180]. Therefore, other score models have been proposed. For instance, IDPicker is based on a simple non-parametric Monte Carlo simulation method. IDPicker employs FDR identification aggregation instead of individual identification probabilities, and it is easily extended to accept scoring metrics from multiple search engines, as long as the decoys are provided in the searched database [180].

In a recent departure from the canonical target decoy approach, Kim et al., propose MS-GF which uses generating functions and their derivatives without a decoy database [181]. They argue that by using a decoy database, the proteomics community is de facto acknowledging that it has been unable to solve the following Spectrum Matching problem: “Given a spectrum S and a score threshold T for a spectrum-peptide scoring function, find the probability that a random peptide matches the spectrum S with score equal to or larger than T” [181]. This problem assumes certain underlying distributions on which probabilistic calculations can be applied. Ideally, the underlying distributions would be purely theoretical in nature to allow the direct calculation of probability and expectation values. However, the sheer number possible parameters makes modeling the theoretical underlying distribution impractical [182]. Instead, p-values and E-values are calculated using heuristic algorithms working on empirically derived distributions. In contrast to the heuristic algorithms, MS-GF demonstrates that it is possible to compute the precise number of peptides identified in a huge database, solving the Spectrum Matching problem.

1.3.3 Outlook

Although many difficulties exist in thoroughly characterizing a proteome no consensus has been reached by the proteomics community on which the peptide identification, protein inference and validation strategies should be used. This is largely due to the fact that shotgun proteomics is relatively immature and more complex compared to other fields such as genomics. Whereas the genomics community can readily compare results from experiments conducted in different laboratories, the proteomics community has difficulty doing so because reporting of results is not standardized. For instance, some shotgun proteomics researchers will report proteins inferred from a single peptide while others will only report proteins inferred from two or more distinct peptides. If a peptide is shared between multiple proteins, some researchers randomly assign the peptide to a protein, while others apply Occam’s razor or other statistical models. This is compounded by the availability of vastly different FASTA databases for a single organism. The differences mainly stem from their curation processes, or lack thereof, and their sources of deposited sequences.

Reporting standards for shotgun proteomics experiments may be lacking consensus, but serious effort has been made to rectify this problem. In 2002, the Human Proteome Organization, HUPO, launched the Proteomics Standards Initiative, PSI. Its goal was and is to “define community standards for data representation in proteomics to facilitate systematic data capture, comparison, exchange and verification.” [183187]. Although HUPO sets standards for the broader proteomics community, publishing criteria was still lacking for shotgun proteomics results. To address this, about 30 key people in the proteomics community met in Paris to develop set of standards focused on publication of shotgun proteomics results. These standards published in 2006 [188] as the Paris Guidelines, and updated in 2009 are slowly being adopted by proteomics journals.

1.4 Label-Free Quantification

The initial application of the MS-based proteomics platform addressed the challenge of cataloging proteins within complex samples. However, biological researchers also need to quantify proteins because proteomes are highly dynamic systems, and their abundances change due to regulation of their synthesis and degradation. Protein activities are dynamically regulated via the addition or removal of PTMs. Therefore, to make MS-proteomics a technology truly useful to researchers who are trying to understand living systems, it must be able to quantify abundance and PTM differences between samples.

The initial technology for quantitative MS-based proteomics involved differential labeling methods with stable isotopes. Isotope labeling methods for quantitative MS-based proteomics have been reviewed in detail [189, 190]. These methods label proteins and/or peptides with stable isotopes (15N, 13C, 18O) through a variety of mechanisms. Stable isotope labeling in cell culture (SILAC) labels proteins via metabolic incorporation of stable-isotope containing amino acids contained in cell culture media [191, 192]. Other methods introduce stable isotopes via reactive chemical tags, such as the isotope affinity tag (ICAT) [193] or isobaric peptide tagging (e.g. iTRAQ, TMT) [194, 195] methods. Labeling with O18 is accomplished via enzymatic means at the c-terminus of peptides within complex mixtures [196]. For all of these methods, distinct protein mixtures are first differentially labeled, one with isotopically normal amino acids or chemical tags, and the other with isotopically “heavy” amino acids or chemical tags. Although most labeling methods compare protein abundance between two distinct mixtures, some are capable of multiplexed analysis, such as iTRAQ labeling which can compare up to eight [197]. After labeling, the mixtures are combined and peptide digests are fractionated and analyzed by MS. Peptide sequences common to both samples, although differentially isotopically labeled, retain the same chemical properties and behave similarly during fractionation. Consequently, differentially isotopically labeled peptides are detected simultaneously and their m/z differences resolved in the MS. Peptides are selected for MS2 and identified via subsequent sequence database searching. For identified peptides, relative abundance levels between samples are determined via comparison of the mass spectral peak intensities corresponding to the normal or heavy isotope labels.

Although still used prominently, stable isotope labeling has its limitations. One is cost. Stable isotope labeled amino acids or chemical tags are costly to synthesize, and purchase of these can run from hundreds to thousands of dollars, depending on the labeling method used. Another is applicability to only certain biological sample types. SILAC, arguably the most accurate stable isotope labeling method, is only applicable to experiments using cell culture models, although extremely expensive studies of whole organism labeling with stable isotopes in mice and worm have been described [198]. For human and other animal studies, chemical tagging methods, such as iTRAQ or TMT, must be used for stable isotope labeling. Unfortunately, the accuracy of iTRAQ and TMT for measuring relative abundances, which are based on MS2 fragmentation of labeled peptides, is decreased due to simultaneous fragmentation of multiple peptides in shotgun proteomics [199].

Responding to these limitations, label-free technology has emerged which obviates the need for stable isotope labeling for quantitative proteomics. Two methods underpin the label-free MS-based quantitative proteomics technology: spectral counting and intensity-based measurements. Figure 1.6 details these two methods.

Fig. 1.6
figure 6

Label-free quantification methods

1.4.1 Spectral Counting Quantification

Spectral counting is based on the core instrumental method used in MS-based shotgun proteomics. Here, peptides separated via LC are detected and selected for CID fragmentation using a data-dependent routine. The fragmentation spectra are recorded as MS2 spectra. Peptides are identified by assigning a sequence to each MS2 from databases of known protein sequences and a variety of software programs, as described in Sect. 1.2. Protein identities in the starting mixture are inferred from the identification of peptides that are a part of their amino acid sequence. Quantification via spectral counting is based on the observation that the number of peptides identified from MS2 spectra is proportional to the abundance of the protein in the starting mixture: more abundant proteins result in more identified peptides while less abundant proteins result in fewer identified peptides. Protein quantification is achieved by simply counting the number of MS2 spectra assigned to peptides within a given protein, without taking into consideration the peptide MS signal intensity. Because quantification is based on peptides assigned to MS2 spectra, spectral counting benefits from MS instruments with higher mass accuracy and sensitivity, which increase the number of high confidence peptide identifications [20].

Early on, spectral counting was done in a rather simple manner, simply summing the number of peptide identifications corresponding to each inferred protein. However, as this method increased in popularity, more sophisticated quantification approaches based on spectral counting have emerged. Several extensive reviews have recently appeared on spectral counting [200]. Here we discuss the most commonly used approaches to spectral counting quantification and some representative studies which have used this method.

Spectral counting must take into account a protein’s length because a longer protein, when enzymatically digested, will produce more peptides than a shorter protein for the MS to detect. Without correction, protein quantification by spectral counting would be biased towards longer proteins. As a consequence, an approach taking into account protein length was developed [201] which provided a normalized spectral abundance factor (NSAF) for each identified protein. The abundance of any given protein within a mixture can be estimated by dividing its NSAF value against the sum of NSAF values for all identified proteins.

An alternative approach to NSAF is the protein abundance index (PAI) [202], which was further improved to the exponential modified PAI, or emPAI [203]. This approach used the number of peptides actually identified from a protein, divided by the estimated total number of peptides expected to be identified for that same protein. The expected peptides were estimated based on the proteins sequence and the sizes of peptides derived from the protein after enzymatic digestion. The relative molar amount of any given protein within a sample can then be calculated by dividing its emPAI value against the sum of all emPAI values within the mixture. The emPAI approach was deployed in a freely available application, emPAI Calc that accepts data from a variety of sequence database searching programs [200].

Another approach, Absolute Protein Expression (APEX), tries to correct for physiochemical variations between peptide sequences that may affect their identification in the MS, and bias spectral counting results. APEX uses a correction factor that attempts to use properties such as amino acid content and length of peptides [204] to assess the probability of any given peptide for MS detection and subsequent identification from MS2 spectra. This correction is applied to the spectral counts corresponding to each identified protein, to provide a more accurate measurement of its abundance. APEX has been released as an open source application [205].

Spectral counting has been widely applied. Its application is reviewed in detail elsewhere [200, 206]. Software plays a key role in the automating spectral counting quantification. Table 1.5 shows a summary of the most popular open-source software available. One particularly powerful application uses spectral counting and NSAF values to quantify relative abundance of proteins within functional complexes. Estimation of relative stoichiometry of the different members of protein complexes [207], as well as modeling of protein-protein interaction networks [201] is possible. An interesting application using the emPAI approach identified and quantified relative abundance levels of over 100 proteins in the chicken egg white proteome [208]. APEX was recently used to characterize proteome abundance differences between mutant strains of the thermophilic anaerobic bacterium Clostridium thermocellum, an organism with promise for biofuel production [209].

Table 1.5 Summary of open-source software for label-free quantification

1.4.2 Intensity-Based Quantification

An alternative to spectral counting is intensity-based measurements of peptide abundance. During a nanoLC-MS analysis, the mass-to-charge (m/z), retention time and signal intensity values are continuously recorded for each detected peptide. This information can be used to reconstruct a chromatographic peak for each peptide. This quantification method estimates the area under the curve (AUC) of the chromatographic peak (Fig. 1.5). The AUC correlates linearly with peptide concentration across a range of low femtomole amounts to tens of picomoles in most contemporary MS instruments [219, 220]. Similar to spectral counting, peptides are identified via MS2 and sequence database searching, and protein identities are inferred from these peptides. For comparisons of peptide and inferred protein abundance between different samples, each sample is analyzed by nanoLC-MS separately. AUC values calculated for detected peptides in each distinct sample are compared to determine relative abundance. Intensity-based measurements are not used for quantification of different peptides within the same sample, because each peptide sequence ionizes with different efficiency, making comparison based on signal intensity inaccurate.

Although simple in concept, successful implementation of intensity-based quantification relies heavily on sophisticated software. Open-source software choices have been reviewed elsewhere [221]. Some of these choices are summarized in Table 1.5. This software automates critical data processing steps needed to insure accurate results based on AUC values. A recent review by Christin and colleagues [221] thoroughly describes these steps. One key step is proper alignment of peaks corresponding to the same peptide across all separate nanoLC-MS data sets. Proper alignment, based on peak m/z values and retention time, assures that the AUC values being measured in each sample correspond to the same detected peptide. Use of highly reproducible nanoLC systems with high chromatographic resolving power can help for alignment [222], although ultimately effective alignment via software is critical. High accuracy measurements of peptide m/z values using newer MS instruments has greatly helped with alignment across separate nanoLC-MS datasets. One nice feature of peak alignment, aided by high mass accuracy data, is that a peptide need only be identified by MS2 in one sample [221]. Peaks in other samples aligning in retention time and accurate m/z can then be confidently assigned to that peptide without the need for their identification from MS2 spectra.

Another key step is normalization of measured AUC values. Normalization accounts for bias and variability in measured AUC values introduced during sample processing, loading of sample to the nanoLC column, and in-run variability of MS response. A number of normalization procedures have been developed which are effective for minimizing variability and improving accuracy [223, 224].

As with spectral counting, applications of intensity-based quantification are numerous. These are reviewed in detail elsewhere [225, 226]. These different applications have used a variety of publically available software programs for accurate quantification, some of which are summarized in Table 1.5. Here we discuss several representative applications. One interesting, radiation research-relevant example, demonstrated the effectiveness of intensity-based quantification to compare effects of ionizing radiation on colon cancer cells compared to a mock-treated control [227]. Disease biomarker discovery has also been a popular application of intensity-based quantification. Such studies have been done in paraffin embedded archival cancer tissues [228], as well as serum fluid from schizophrenia patients [229].

Overall, label-free quantification addresses many of the limitations of stable-isotope labeling-based technology. Both spectral counting and intensity-based measurements are cheap and simple, with no need for purchase of costly labeling reagents or extra sample labeling and processing steps. Spectral counting provides the additional benefit of measuring relative abundance of proteins within the same sample, whereas stable isotope labeling only measures relative abundance across separate samples. Intensity-based measurements, when using effective software for aligning peptide peaks across samples, obviates the need for time- and computation-intensive MS2 acquisition and subsequent peptide identification via sequence database searching. This method therefore is an attractive choice for biomarker studies, where comparison across many patient samples with high throughput is desirable.

Despite numerous strengths, label-free quantification is not without limitations. Unlike some labeling methods, notably the iTRAQ or TMT methods, multiplexed comparative analysis within a single MS experiment is not possible. Instead, each sample being compared must be analyzed in a separate MS experiment, and preferably with technical replicates to achieve statistical significance [230]. Consequently, large amounts of instrument time are required, which may not be feasible, especially for researchers relying on sample analysis via a fee-for-service facility. Low-abundance proteins also remain a challenge for both methods. Because spectral counting relies on multiple peptides to be identified from each inferred protein to achieve statistical significance, low abundance proteins identified by only a few peptides cannot be accurately quantified. For intensity-based measurements, peptide peaks from low-abundance proteins also suffer from low signal-to-noise ratios, challenging their accurate quantification. Improved instrument sensitivity should only help to increase the ability to identify more peptides derived from low abundance proteins, and improve the effectives of both label-free methods. Recently, a promising new method was described [212] which combines spectral counting and intensity-based measurements, thereby capitalizing on the strengths of both methods and providing improved results.

1.5 Conclusions

Consistent with history, technological advances will continue to define and mature the field of MS-based proteomics, catalyzing new milestones of achievement. We anticipate these advances to primarily fall in the areas described in this review: new instrumentation and related methods, and new computational methods and software for identification and quantification of proteins from complex datasets. Continued maturation of MS-based proteomics should one day enable realization of its ultimate goal: comprehensive proteome characterization. Researchers seeking to better understand the effects of radiation on living systems will undoubtedly continue to benefit from the continued advances of this vital technology.