Key words

1 Introduction

With recent advances of mass spectrometry and separation methods, liquid chromatography coupled with mass spectrometry (LC-MS) has become an essential analytical tool in biomedical research. LC-MS provides qualitative and quantitative analyses of a variety of biomolecules in a high-throughput fashion, and there has been significant progress in systems biology research and biomarker discovery using LC-MS based proteomics [13].

LC-MS methods can be used for extraction of quantitative information and detection of differential abundance [46]. This requires that a rigorous analysis workflow be implemented. In addition to analytical considerations, crucial steps include: (1) experimental design that avoids introducing bias during data acquisition and enables effective utilization of available resource [7], (2) data preprocessing pipeline that extracts meaningful features [8], and (3) statistical test that identifies significant changes based on the experimental design [9]. Conducting these three steps in a coherent manner is key to a successful LC-MS based proteomic analysis. Good experimental design helps effectively identify true differences in the presence of variability from various sources. This benefit can diminish if the data analysts fail to appropriately analyze the LC-MS data and conduct the subsequent statistical tests in accordance with the experimental design. This chapter introduces data preprocessing pipelines for LC-MS based proteomics, with a focus on untargeted and label-free proteomic analysis. We provide a high-level overview of LC-MS data preprocessing and highlight associated challenges. Furthermore, we present a step-by-step example for analysis of LC-MS data from untargeted proteomic study, and how this could be utilized in subsequent evaluation using targeted quantitative approaches such as multiple reaction monitoring (MRM ).

2 LC-MS Data Preprocessing

In a typical untargeted proteomic analysis, proteins are first enzymatically digested into smaller peptides, and these thousands of peptides can be profiled in a single LC-MS run. The profiling procedure involves chromatographic separation and MS based analysis. Due to the difference in hydrophobicity and polarity among other properties, each peptide elutes from the LC column at distinct retention time (RT). The eluted peptide is then analyzed by MS or tandem MS (MS/MS). An LC-MS run contains RT information in chromatogram, mass-over-charge ratio (m/z) in MS spectrum, and relative ion abundance for each particular ion. MS signals detected throughout the range of chromatographic separation are formatted in a three-dimensional map, which defines the data from a single LC-MS run, as shown in Fig. 1. The LC-MS data contain quantitative information of detected peptides and their associated proteins, which are identified by de novo sequencing or database searching using MS/MS spectra [10]. A reliable preprocessing pipeline is needed to extract features (usually referred to as peaks) from LC-MS data, in which each peptide is characterized by its isotopic pattern resulting from common isotopes such as 12C and 13C in a set of MS spectra within its elution duration, in superposition of noise signals (Fig. 2). Adequate consideration of such characteristics is crucial for LC-MS data preprocessing, including steps of noise filtering, deisotoping, peak detection, RT alignment , peak matching and normalization. Typically, these data preprocessing steps generate a list of detected peaks characterized by their RTs, m/z values and intensities. The preprocessed data can be used in subsequent analysis, e.g., identification of significant differences between groups. Association of these peaks with peptides/proteins is achieved through MS/MS identification, which is out of scope of this chapter, and we refer to interested readers to the literature [10]. In this section, critical preprocessing steps are introduced and discussed.

Fig. 1
figure 1

An LC-MS run contains RT information in chromatogram, mass-over-charge ratio (m/z) in MS spectrum, and relative ion abundance for each particular ion

Fig. 2
figure 2

A typical feature in LC-MS data

2.1 Noise Filtering

LC-MS data are subject to electronic/chemical noises due to contaminants present in the column solvent or instrumental interference. Appropriate noise filtering can increase the signal-to-noise ratio (SNR) and facilitate the subsequent peak detection step. Some software tools, e.g., MZmine 2 [11], integrate the noise filtering into the peak detection step to ensure coherence. Smoothing filters such as Gaussian filter and Savitzky-Golay filter [12] are commonly applied to eliminate the effects of noises. Due to the differences in terms of resolution and detection limit among various LC-MS platforms, parameters for the smoothing filters need to be adaptively selected, preferably through a pilot experiment with similar experimental settings.

2.2 Deisotoping

Most chemical elements have naturally occurring isotopes, e.g., 12C and 13C are two stable isotopes of the element carbon with mass numbers 12 and 13, respectively. Consequently, each analyte gives rise to more than one ion peaks in an MS spectrum, where the peak arising solely from the most common isotope is called the monoisotopic peak. In LC-MS based proteomics, each peptide is characterized by an envelope of ion peaks due to its constituent amino acids. 13C constitutes about 1.11 % of the carbon species and the approximately one dalton (Da) mass difference between 13C and 12C results in 1/z difference between adjacent ion peaks in the isotopic envelope, where z is the state of a charged peptide. The deisotoping step integrates siblings of ion peaks originating from the same peptide and summarizes by its monoisotopic mass. This facilitates the interpretation of LC-MS data and reduces the complexity in subsequent analysis. DeconTools [13] is widely used to deisotope MS spectra, which involves: (1) identification of isotopic pattern, (2) prediction of the charge state based on the distance between the ion peaks, and (3) comparison between the observed isotopic pattern and a theoretical distribution generated based on an average residue.

2.3 Peak Detection

Peak detection is a procedure to determine the existence of a peak in a specific range of RT and m/z value, and to quantify its intensity. Many LC-MS peak detection approaches [11, 14, 15] are adapted from previously established methods such as those for analysis of matrix-assisted laser desorption/ionization time of flight (MALDI -TOF) MS data [16, 17]. In consideration of the isotopic pattern naturally present in LC-MS data, alternative strategies have also been exploited (e.g., as in the MaxQuant platform [18]). Most existing methods perform peak detection via a pattern matching process, followed by a filtering step based on quantified peak characteristics. A critical issue is that the elution profiles may vary across different RTs [19]. As a result, the use of a single pattern throughout the whole RT range in the current approaches may lead to inaccurate estimates of peak characteristics and SNR, where the latter is often employed as a filtering criterion. Also, peak detection is usually performed for each LC-MS run individually, without leveraging the information from other runs in the same experiment. Utilization of multi-scale information from multiple runs has been proposed for analysis of MALDI-TOF data [20]. This idea could potentially be applied to LC-MS data and lead to a more reliable peak detection result, where the peak matching step to be introduced later plays an important role.

2.4 Normalization

Due to the presence of various analytical and technical variability in LC-MS data, it requires appropriate normalization of intensity measurements to remove systematic biases and eliminate the effect of obscuring variability. One of the typical normalization approaches carries out the task through identifying a reference for ion intensities and making adjustment based on the reference. Apparently, identification of reliable reference is crucial for the normalization process. Most existing methods assume that each of the LC-MS runs in the same experiment should have an equal concentration of molecules on average [21]. With this assumption, measures including summation, median, and quantile of the ion intensities are used as the reference for normalization. Unfortunately, the validity of this assumption is questionable as an increase of concentration in a specific group of molecules is not necessarily compensated by a decrease in other groups [22]. More rigorous approaches using regression methods based on a set of matched peaks [23] or spiked-in internal standards [22] have been proposed. However, it is unclear that if neighboring ions (in terms of RT, m/z value, or intensity) would necessarily share a similar drifting trend along the analysis order. At present, the use of quality control (QC) runs to assess and correct variability in LC-MS data appears to be the most reliable approach [24], in which QC runs can be collected using a reference sample or a mixture pooled from the analyzed samples. This idea has been successfully implemented for large-scale metabolomic studies, where variability along the analysis order is estimated for each of the detected peaks through assessment of the QC runs [24]. This circumvents the need to select an arbitrary reference, with additional experimental challenges to assure appropriate coverage and reproducible detection of ions in the QC runs. Alternatively, a recently published method called MaxLFQ [25] leverages information from every pair of peptides between samples to account for the reproducibility issue and exploits such information to accomplish normalization at protein level.

2.5 RT Alignment and Peak Matching

The peak matching step groups consensus peaks across multiple LC-MS runs prior to subsequent analysis, e.g., identification of significant differences between samples, to ensure a valid comparison of the LC-MS runs. Also, it is crucial for potential extensions of peak detection and normalization steps, by leveraging information from multiple runs. The main challenge in peak matching results from the presence of RT variability among LC-MS runs. Recent advances in MS technology have made highly precise and accurate mass measurement (low- to sub-ppm) achievable [26]. However, controlling the chromatographic variability remains challenging. Most LC-MS preprocessing pipelines, (e.g., OpenMS [14], msInspect [27], MZmine 2 [11]) integrate the estimation of RT variability into the peak matching step, in order to perform RT alignment and achieve reliable identification of consensus peaks.

RT alignment approaches can be categorized as: (1) feature-based approaches and (2) profile-based approaches [28]. The feature-based approaches perform the alignment task based on detected peaks and rely on the correct identification of a set of consensus peaks among LC-MS runs. On the other hand, the profile-based approaches utilize chromatograms of the LC-MS runs to estimate the variability along RT and then make an adjustment accordingly [2931].

Incorporation of information from peptide identification can reduce the matching ambiguity and improve the alignment result [32, 33]. For example, the PEPPeR platform [33] integrates peak lists and MS/MS identification for RT alignment . More sophisticated approach has been implemented in MaxQuant [18], which leverages each preprocessing step to enhance the overall performance. In profile-based alignment, utilization of complementary information from various sources has also been shown to yield better alignment performance [30].

3 Pipeline for LC-MS Data Preprocessing

Several preprocessing pipelines have been made available in various software tools including OpenMS [14], msInspect [27], MZmine 2 [11], and MaxQuant [18]; however, very few studies have systematically evaluated and compared their performance [34]. As a result, determination of the most appropriate pipeline is still challenging. As a starting point, we present a step-by-step example using MaxQuant in this section. This software tool is chosen for demonstration due to (1) its ease of use, (2) its capability to handle data from large-scale LC-MS experiments, and (3) its active discussion forum.

MaxQuant can be downloaded from http://www.maxquant.org after registration. A personal computer with CPU frequency at 800 MHz and RAM at 2GB per thread is the minimum requirement for installation. Multicore processor is recommended for parallel computation. Prerequisite software/plug-ins include Xcalibur, MSFileReader, and .NET Framework 4.5. A peptide search engine, Andromeda [35], is integrated as part of MaxQuant and downstream bioinformatics and statistical analyses on the outputs of MaxQuant can be performed using Perseus, if needed. Users are referred to the forum (https://groups.google.com/forum/#!forum/maxquant-list) for related discussions and possible solutions. For comparative analysis by label-free LC-MS methods, detailed preprocessing steps using MaxQuant (version 1.4.1.2) are described in the following.

3.1 Importing Files

  1. 1.

    Launch the MaxQuant graphical interface (Fig. 3) and load the .raw files (from Thermo instruments) to be processed. The basic information (file name, size, etc.) of the imported data will be displayed on the interface. Specify additional information (e.g., fraction labels) for the MaxQuant analysis using the experimental design template.

    Fig. 3
    figure 3

    Procedure of MaxQuant (Subheading 3.1): loading files (panel a) and setting up experimental design template (panels b–c)

  2. 2.

    Click the icon of “Write template” to generate a “combined” folder in the same location of the .raw files.

  3. 3.

    Under the newly generated “combined” folder, open the template file “experimentalDesignTemplate.txt” using appropriate text editor (e.g., Microsoft Excel).

  4. 4.

    The template file presents a table with three columns, where the “Name” column should have been filled in with the .raw file names. Complete the table with distinct numbers in the “Fraction” column and group information in the “Experiment” column (see Note 1). Save these changes.

  5. 5.

    Click the “Read from file” icon and select the modified template file to import the specified information.

3.2 Setting Group-Specific Parameters

Click tab “Group-specific parameters” (Fig. 4a), where default values are given for general experiment information, label-free quantification, first search, and advanced settings. If data with different experimental protocols are processed together, users can set specific parameters for each group. Modify settings within each parameter group according to specific experiments.

Fig. 4
figure 4

Procedure of MaxQuant (Subheadings 3.23.4): setting group-specific parameters (panel a); setting global parameters and starting analysis (panel b)

  1. 1.

    The “Type” setting is machine dependent. Select “All Ion Fragmentation” if an Exactive is used. “Standard” (default) should be selected for other Thermo instruments (XL, Velos, etc.).

  2. 2.

    Specify labels, if a labelling strategy is used. For a label-free analysis, select “Multiplicity” as “1”.

  3. 3.

    “Variable modifications” settings describe the chemical reactions on the proteins. This does not include fixed modifications that should be selected under “Global parameters”.

  4. 4.

    Select the enzyme used to digest the proteins in “Digestion mode”. Trypsin is used in most cases.

  5. 5.

    Indicate maximum allowable missed cleavages during enzymatic digestion. Default allowable value is “2”.

  6. 6.

    Specify the instrument type.

  7. 7.

    Select “LFQ” for label-free analysis.

  8. 8.

    The “First Search” and “Main Search” (under “Advanced”) specify a two-step search in MaxQuant, where a number of peptides are selected for calibration of mass and RT, followed by a refined search.

3.3 Setting Global Parameters

Click tab “Global parameters” (Fig. 4b), where default values are given for settings including general analysis information, sequences, identification, protein quantification, site quantification, label-free quantification, isobaric label quantification, etc. These settings apply for all data files. We describe critical settings to modify parameters according to specific experimental designs in the following steps.

  1. 1.

    Click “Add file” to load the .fasta files for the database against which the processed spectra are searched. The files are parsed through Andromeda configuration (see Note 2).

  2. 2.

    Specify the fixed modifications such as carbamidomethylation of cysteine.

  3. 3.

    “Re-quantify” allows the first search as calibration steps prior to the more exact main search and re-calibration steps. “Match between runs” enables association of spectral identification across LC-MS/MS runs based on RT and accurate mass. These two boxes are recommended to be selected.

  4. 4.

    In “Sequences” section, set “Decoy mode” and “Special AAs”, select “Include contaminants”, and load other .fasta files if the database used for first search is different from the one loaded in “Fasta files”.

  5. 5.

    Set the searching parameters in “Identification”, such as false discovery rate (FDR ), number of peptides required for a valid identification, minimum peptide length, minimum number of unique (see Note 3) and razor peptides, posterior error probability (PEP), and score cutoff. Deselect the “Filter labelled amino acids” box for label-free analysis.

  6. 6.

    Specify the quantification methods in “Protein quantification,” including minimum ratio count, peptide type for quantification, and whether modified peptides are considered.

3.4 Starting Analysis

  1. 1.

    Set “Number of threads” available to the analysis on the bottom of the setting window for global parameters (Fig. 4b). Using more threads yields faster computation times.

  2. 2.

    Start the analysis with the above settings. The progress can be monitored in the “Performance” tab.

4 Analysis of Targeted Quantitative Proteomic Data

Untargeted LC-MS based proteomics is generally biased towards analysis of the most abundant and observable proteins. Biologically relevant molecular responses, however, are often less discernible in that analysis. Targeted quantification by multiple reaction monitoring (MRM ) using triple quadrupole (QqQ) mass spectrometers has been introduced to overcome the limitations of untargeted analysis [36]. Briefly, the MRM method organizes the analysis of a specific list of peptides associated with targeted proteins, characterized by the m/z values of their precursor and fragment ions. The precursor-fragment ion pairs are called transitions, which are highly specific and unique for the targeted peptides. A specific ion is selected in the first quadrupole (Q1) on the basis of its precursor m/z value. The ion gets fragmented by collision-induced dissociation (CID) in the second quadrupole. Only the relevant ions produced by the fragmentation are selected in the third quadrupole (Q3). The resulting transitions are then used for quantification. As the data acquisition is highly specific with less interference from irrelevant ions, the MRM analysis can yield more sensitive and accurate quantification results.

Most bioinformatics tools developed for targeted proteomic data analysis have been either limited in their functions or restricted to specific instrument vendors [37]. Freely available software, such as MaRiMba [38], MRMaid [39], and TIQAM [40], are only designed to aid creation of transition list. Other proprietary software, such as Agilent Mass Hunter Workstation, Applied Biosystems MRMPilot, Thermo-Fisher Pinpoint, and Waters TargetLynx, are limited to specific instrument vendor and not freely accessible. MRMer [41] and Skyline [37] are two instrument-independent and freely available platforms used for MRM analysis. In this following, we briefly present major steps for targeted quantification using Skyline, including design of transition list and analysis of acquired MRM data. This software can be downloaded from https://proteome.gs.washington.edu/software/skyline.

To design a transition list using Skyline, users should import spectral libraries (e.g., public spectral libraries or results from search engines such as Andromeda applied in untargeted analysis) and background proteome files (e.g., human proteome database) to provide background information of the targeted proteomic experiments, upon which, the Skyline can read and match the inserted targeted protein list (in fasta sequences or protein IDs, typically from untargeted proteomic data analysis). Skyline allows the users to customize the parameters of generated transitions (e.g., precursor charges, ion types, and product ions). The selected transitions and corresponding spectra are well visualized in Skyline windows. This facilitates further refinement such as removing poor matches in the spectral library before exporting the list. To analyze MRM data acquired with transition lists already designed (unnecessarily by Skyline), we set up the background proteome information and insert the transition list with associated proteins into Skyline. The data collected on a QqQ MS instrument using this transition list are then imported. Skyline begins loading the files into their high-performance data caches, where the relevant information can be retrieved efficiently. Meanwhile, peak detection is automatically performed and detected peaks are assigned to their corresponding transitions. Once completed, Skyline highlights the transitions with their integration boundaries and measured signals. The users can inspect the data by comparing replicates (across samples) in terms of their RT and intensity ratios. Manual curations are allowed to correct erroneous assignment and adjust the integration boundaries (see Note 4). Finally, the quantification results can be customized and exported into a .csv file, on which, the downstream statistical analysis can be performed.

5 Notes

  1. 1.

    If the user specifies an identical name for several LC-MS runs in the experiment column, their information will be combined and these individual runs will not be compared. This is, however, an ideal setting if they are all fractions of the same sample.

  2. 2.

    Andromeda configuration is required before starting MaxQuant to correctly retrieve protein sequence information from the .fasta files, as different databases may be delimited in distinct ways. Figure 5 illustrates the main configuration steps including (1) loading a new database entry by clicking the green plus button (“+”) in tab “Sequence”; (2) importing user-defined .fasta file; (3) specifying a parsing rule form the list in the “Select Rule” tab; (4) checking if Andromeda is able to retrieve the information from the .fasta file correctly in the “Test Rule” tab; (5) clicking the green plus button (“+”) in the top-left corner of the “Select Rule” panel if users need to write specific rules.

    Fig. 5
    figure 5

    Configuration of Andromeda

  3. 3.

    The uniqueness of peptide is related to the proteome database. In MaxQuant, a peptide is recognized as unique to a group of proteins (termed protein group) if on the entire proteome its sequence only occurs in this group.

  4. 4.

    The inspection is crucial for cases where multiple peaks are detected, and consequently selection of the best peak may not be consistent across samples. To improve the performance of peak selection, Skyline also allows users to create custom advanced selection models and to utilize information from iRT retention time prediction of peptides.