Keywords

1 Retention Time Prediction

A peptide’s retention time (RT) is defined as the length of time elapsed from the injection of a sample into the chromatography system to the detection of peak maximum of a peptide. It depends on its chemical structures of peptides, along with the interaction between the environment (mobile and stationary phase, temperature, pH, etc.). Therefore, peptide RTs in a particular liquid chromatography (LC) condition can be predicted based on chemical structure-related properties of peptides, such as amino acid composition, sequence, hydrophobicity, and other physicochemical properties [1].

The task of RT prediction is to calculate a retention scale for each peptide in the given LC condition, e.g., to calculate the hydrophobicity scale in reverse-phase LC. A simple idea is to measure or predict retention coefficients for individual amino acids, and then, the retention scale of a peptide is predicted as the sum of retention coefficients of its constituent amino acids. The amino acid retention coefficients can be predicted either by a set of synthetic peptides with residues substituted by each of the twenty amino acids [9] or linear regression models based on peptides with various amino acid compositions [2, 21, 22, 31].

In the recent years, prediction models were refined by employing peptide sequence information and more intelligent computational algorithms, as well as large size of datasets that could prevent the problem of overfitting in data training [16, 27]. N-terminal residues were found to be influence factors to peptides’ retention behavior due to the ion-pairing retention mechanism [19]. Taking into account of this effect, Krokhin et al. developed a widely used prediction model, sequence-specific retention calculator (SSRCalc) [16]. This model added a series of sequence-related correction factors to the previous model that predict peptide retention scales by the summation of individual amino acid retention coefficients [9]. Besides three of the N-terminal residues, these correction factors included C-terminal residues, nearest-neighbor effect of charged side chains (Lys, Arg, and His), peptide length, isoelectric point, hydrophobicity, propensity to form helical structures, etc. Another comprehensive model was built by Petritis et al. [27] based on artificial neural network. Similar to SSRCalc, their model embodied peptide properties such as length, sequence, nearest-neighbor amino acids, hydrophobicity, and hydrophobic moment, as well as predicted secondary structures as the input nodes of the neural network. Some other prediction models were developed in similar idea, but with different choices of peptide properties and statistical models [15, 29, 23].

The refined modes improved the prediction accuracy (R2) significantly from approximately 0.91–0.92 to 0.96–0.98 [17]. However, these conclusions were based on limited size of datasets and reported by the authors themselves. A blind comparison of the most updated versions of prediction models would help greatly in the selection of proper prediction model for practical use. Besides, considering that models based on sequence information and intelligent computational algorithms often require a lot of computational time and large size of training datasets, the simpler and linear prediction models that provide less, but also sufficient prediction accuracy may be selected in some cases, such as on-the-fly RT prediction and calibration [10].

2 Application of RT Information in Proteomic Analysis

2.1 Peptide Identification Based on LC-MS Data

Accurate mass and time tag (AMT tag) is a well-known strategy to identify peptide sequences based on LC-MS data, which was firstly invented to identify the Deinococcus radiodurans proteome [34, 38]. Given the fact that many possible peptide species are unlikely to be detected in a particular biological system, this strategy assumes that peptides that are detectable in a biological system can be separated by a two-dimensional mass and RT vector [44]. Two main steps are included in this strategy. In the first step, an AMT database for a particular organism or type of biological sample is constructed based on high-confident peptide identifications from previous replicate LC-MS/MS analysis. Secondly, peptides are identified from LC-MS experiments by matching measured mass and normalized elution time (NET) features to the existing database.

There are similar methods that are also identify peptides based on the accurate measurements of mass and RT [11, 24, 41]. These methods do not need to construct a reference database prior to peptide identification. Instead, features are matched by measured mass and RT between different LC-MS/MS runs. Then, peptide identifications from MS/MS spectra can be transferred from one single run to the others. In a study of urinary proteome [25], using “match between runs” option implemented in MaxQuant software [3], the authors were able to increase number of protein identifications from an average of 462 to 633 in a single run.

Saving the effort from MS/MS analysis, AMT tag and similar methods can improve the efficiency and coverage of proteomic analysis. The success of these methods depends on the complexity of biological system as well as the resolution of both MS instruments and LC systems. False discovery rate (FDR) or confidence of peptide identification can be estimated by decoy database searching (shifting masses of all peptides in the AMT database by a certain value) [28] or statistical models [20, 37, 43]. Study of computational simulation showed that for organisms with relative small proteomes, such as Deinococcus radiodurans, modest mass and RT accuracies were sufficient for confident peptide identifications by the AMT tag strategy. For more complex proteome, such as human proteome, more strict criteria should be used. The majority of proteins could be uniquely identified within the tolerances of 1 ppm for mass and 0.01 for NET [26].

2.2 Peptide Identification from MS/MS Spectra

RT information has been used to improve peptide identification from MS/MS spectra in several ways. One strategy is to incorporate RT information into a discriminant function along with other peptide-spectrum matching parameters, such as SEQUEST scores [39]. This discriminate function was trained based on data from a known protein mixture. When applying to human plasma proteome analysis, it achieved a 16 % increase of positive peptide identifications.

Predicted RT information can serve as a validation parameter for peptide identification results generated by database searching programs. Kawakami et al. [12] validated peptide identifications by the correlation between measured and predicted RTs. Peptide identifications within a certain correlation tolerance were accepted as high-confident identifications. Several studies reported that number of true positive peptides increased significantly by the combination use of RT filter and lower threshold of database searching score [15, 29, 33].

Besides the application of predicted RT information, Sun et al. built up an empirical RT database based on high-confident peptide identifications from repeated LC-MS/MS runs of a urine sample [40]. This database was used to validate MS/MS identifications for new urine samples. The bottleneck of the empirical database method is that it can only be applied to peptides that were previously detected in a particular proteome, whereas every peptide sequence can have a predicted RT value. However, this method still has its value because it avoids the problem of incorrect RT prediction, which is evitable due to the complex nature of peptide retention behavior.

2.3 Post-translational Modification Identification

PTM on a peptide alters not only its molecular mass, but also its physicochemical property (e.g., hydrophobicity), resulting in RT shifts. The RT difference between modified and unmodified peptide (ΔRT) provides a new dimension of information in additional to mass shift (ΔM) in PTM identification.

Previous studies reported lots of instances that peptides with different modification types or different modification sites elute in different RTs [4, 13, 32, 42]. Zybailov et al. [45] depicted the ΔRT distributions of dozens of modification forms detected in a plant proteome. They found that the direction of RT shifts correlated well with the hydrophobicity shifts of the modified peptides for the majority of modifications. Combination of ΔRT and ΔM constrains can efficiently reduce the FDR in PTM identification [32], especially for studies on low-resolution mass spectrometers. For example, deamidation of a peptide results in a mass shift of only 0.984 Da, which could not be accurately distinguished from its unmodified form by a low-resolution LCQ mass analyzer. A study [4] based on synthetic peptide pairs observed that deamidated peptides elute about 3 min later than the corresponding unmodified forms in RPLC. Deamidation detection accuracy was improved from 42 to over 93 % by filtering original SEQUEST identifications by both ΔRT and ΔM constrains.

ΔRT information was also used to improve the algorithms for fast search of unrestricted modifications. The Delta Accurate Mass and Time (DeltAMT) algorithm [7] calculates a two-dimensional delta vector (ΔM, ΔRT) for each pair of spectra obtained in a LC-MS/MS run. The whole set of spectrum pairs are composed of two classes, those from modified and unmodified forms of the same peptide and those from two unrelated peptides. Thus, there are two classes of delta vectors, modification-induced ones and random-induced ones. Bivariate Gaussian mixture models are employed to discriminate modification-induced distributions from random ones. Then, putative modifications could be identified and reported with (ΔM, ΔRT) information as well as the putative modified and unmodified spectrum pairs. Since this algorithm does not use any fragment ion information from MS/MS spectra, it is able to find out high-confident modifications in a very fast speed. However, this algorithm is limited to high abundant modifications, since vector distributions of low abundant modifications are not usually distinguishable from random ones.

2.4 Time-scheduled Targeted Proteomic Analysis

Multiple reaction monitoring (MRM) is the method of choice in targeted proteomics. It is a highly sensitive method for accurate quantitation of low abundance proteins in complex protein mixtures. This method needs a sufficient dwell time for each transition to maintain sensitivity and a reasonable cycle time to ensure accurate quantitation. Thus, only a limited number of transitions can be measured in each cycle, limiting its throughput [30]. Time-scheduled transition acquisition (tMRM) offers a solution that can remarkably increase the throughput of traditional MRM experiment without compromising its performance. In this method, the whole gradient time is split into small time windows, and transitions are monitored only in selected windows centered around the expected RT of peptides. Thus, with the same dwell time setting and number of transitions monitored in each duty cycle, tMRM is able to measure many times of transitions in the whole gradient time [36].

A key point to the success of tMRM is to define proper RT window that can capture the entire peptide elution profile from baseline to baseline. This depends on accurate prediction of peptide RTs for each injection. In spite of strict control of the LC system, RT shifts between injections are inevitable, especially when experiments lasting for days to weeks to analysis large amounts of samples. To fit in with the RT shifts, predefined RT windows need to be regularly corrected or repredicted, reducing the efficiency and robustness of tMRM experiment. To aid this situation, on-the-fly RT calibration methods have been developed and integrated in the instrument operating software [8, 14].

This method makes use of a set of well-characterized landmark peptides to calibrate RTs of targeted peptides. Landmark peptides could be either spiked-in synthetic peptides [6, 8] or endogenous peptides that distribute in a broad range of the whole gradient. At any time point, RT windows of subsequent targeted peptides are adjusted based on a local linear regression model generated by the last two eluted landmark peptides. RT windows of peptides elute between the first and second landmark peptides can be simply adjusted by RT shift of the first landmark peptide to calibrate the difference in dead volume. Broad RT windows are set for all landmark peptides as well as peptides elute before the third landmark peptide to ensure that they can be captured without or with minimal calibration.

This method achieved over 90 % success rates on analyses of 180 targeted peptides in a gradient from 0.5 to 2 % solvent B per minute, as well as a nonlinear gradient [8]. It could also precisely correct RT shifts caused by other factors such as change of loading amounts of samples [6] and different LC columns [14]. This method significantly increases the robustness of the entire tMRM workflow by compensating for several commonly occurred changes in experimental conditions, reducing the requirement of LC reproducibility in analysis. Researchers can be rescued from offline RT calibration of LC system and refinement of RT prediction models, saving experimental time, and importantly, precious biological samples.

3 Discussion and Perspective

It has been well proven that using RT information could benefit proteomic data analysis. However, its application in practical proteomic analysis has so far been restricted because RT information is of lower resolution compared to mass information, and importantly, peptide RT alters in different LC conditions. Krokhin and colleagues addressed this issue by optimizing their SSRCalc prediction model by four popularly used LC conditions in proteomics. These LC conditions are 300 Å-TFA, 100 Å-TFA, 100 Å-formic acid, and 100 Å-pH 10 [5, 16, 18]. However, since there are hundreds of choices of mobile and stationary phases and other LC parameters in practice, it is an impossible task to pretention retention scales for all LC conditions. A more flexible solution is to train and test the prediction model in the same LC run [15]. Theoretically, this solution is able to adapt all LC conditions. The limitation of this solution is that it needs a sufficient set of high-confident peptide identifications for model training, which is not always available in a single LC run. Another prediction model, ELUDE, is the combination of the above two solutions [23]. When sufficient data are available, ELUDE derives a new RT index for the condition at hand; otherwise, it selects and calibrates a pretrained model from a library of predictors. Model selection and calibration processes are performed automatically by robust statistical methods in ELUDE, facilitating its practical use. However, it should be noted that the accuracy and efficiency of all prediction models are still needed to be tested blindly by datasets covering a great variety of LC conditions.

LC alignment is another important technology in this field. Slight changes of LC conditions and inevitable RT shifts between LC runs can be adjusted by this technology [8]. A recent review of LC alignment methods can be found at [35]. A good idea is to employ a set of spiked-in synthetic peptides as landmarks for LC alignment or to correlate predicted retention scales and measured RTs for each run. These peptides are designed to span a wide range of hydrophobicity, allowing accurate alignment for the entire LC profile. For example, six synthetic peptides were employed to optimize the SSRCalc model in different LC conditions (2009); the eleven iRT standard peptides were used for on-the-fly RT calibration in tMRM analysis [6].

To use RT information as a parameter in data analysis, a proper tolerance value or window size should be set up firstly. This depends on the experimental reproducibility heavily. The wider the RT window is, the more false positives would be achieved. Therefore, there is also an urgent need to set up standards and quality control methods for LC experiments. With the joint effort of bioinformaticists and experimental biologists, RT information would be widely used in practical proteomic analysis in the near future.