Background

Liquid chromatography (LC) coupled with tandem mass spectrometry (MS/MS) is the method of choice for the identification of proteins extracted from biological samples. The standard procedure of post-MS/MS data processing involves computer-aided interpretation of the measured spectra with MASCOT [1], SEQUEST [2] or some other software for comparing theoretical spectra calculated for database sequences with the experimental ones. But modern instruments generate extremely large sets of MS/MS spectra (in the order of 10000 per sample), which are heavily contaminated with different types of background and noise. In addition to b-, y- and their derivative ions from peptides, spectra contain repeated shifted signals due to the natural isotope distribution (isotope clusters), multiply charged replicas, peaks from unknown fragmentation pathways, sample-specific or systematic chemical contaminations and random noise from the electronic detection system.

Thus, the spectra consist mostly of background; typically, only a few percent of the spectra recorded have signals from target protein fragments and just about 10% of the peaks in such a spectrum contribute to the peptide identification. Thus, computer resources in mass spectrometry departments all over the world are mostly spent on analyzing non-relevant data if the identification of the protein with significance is possible within the background at all. This strategy clashes with limitations in compute server capacity in proteomics laboratories and seriously limits the access of less generously equipped teams to the field.

With the broad availability of accurate MS/MS instruments with resolution in the order of tenths of a Dalton, automatic background removal procedures before interpretation software application became possible [35]. Various spectrum pre-processing rules, deconvolution of multiply charged peaks and deisotoping procedures have been described [615]. It should be noted that many spectra do not contain peaks from peptide fragmentations or are extremely noisy and, therefore, are non-interpretable into peptide sequences reliably. Thus, the exclusion of non-interpretable spectra is a valid strategy for reducing the computational load. For a well performing method, one would desire it to remove clearly more than half or three quarters of the experimental MS/MS spectra and, essentially, to keep all interpretable ones. At the same time, computation time for this task should be negligible or, at least, small compared to the processing time used by an interpretation program such as MSACOT that is saved by unselecting a large spectra subset.

Published approaches to this problem differ in the criterion for spectrum selection, either with empirically defined score functions or with a classifier generated by automated learning approaches [1623]. Although many of these methods apply quite sophisticated criteria, they either are not efficient filters or suffer from a substantial fraction of unselected but nevertheless interpretable MS/MS spectra (e.g., loss of ~10% of the interpretable spectra for removing ~75% of the total number spectra in Figures 2 and 3 of Bern et al. [18]). Thus, substantial computational load reduction is traded in for the risk not to find the desired peptide hit. Consequently, none of the published techniques has routinely entered the laboratories so far.

In the attempt to develop an alternative methodical approach, we propose to return to ideas from the beginning of mass spectrometry of proteins. Originally, interpretation of an MS/MS spectrum meant experts trying to manually find sequence ladders (i.e., sets of peaks with amino acid mass spacing between them) among the high-intensity peaks. The concept of searching mainly among the higher intensity peaks is still reminiscent in the formulas for evaluating the significance of a peptide hit as used in MASCOT [1]. Indeed, a peptide the theoretical fragmentation spectrum of which matches exclusively low intensity peaks cannot serve as convincing explanation of the experimental data.

In this work, we explore the idea that at least some short oligopeptide segment of a significant peptide hit should be fully matched by the higher intensity peaks in the spectrum. In an efficient implementation, the computational costs are low if one tries just to check whether small peptide ladders of predefined length do occur in a MS/MS spectrum at all among the top fraction of most intense peaks. The identity of the oligopepetide is not important in this context; it is rather questioned whether such an amino acid chain theoretically exists at all. It is reasonable to suggest that the spectrum is probably not interpretable into a peptide sequence with statistical significance if not even a short oligopeptide sequence is matched by this criterion.

After this unselecting procedure, the remaining spectra still contain considerable background in the typical case. In a previous publication [24], we developed an approach based on techniques from electrical signal processing. Periodical band-reject and high-frequency filters as well as correlation analyses with etalons of multiply charged clusters can successfully be used for background suppression. In this work, we describe a workflow involving sequence ladder and improved signal processing criteria on a large MS/MS dataset exemplified in the MS Cleaner version 2.0 that efficiently reduces the number and the size of spectra and, subsequently, dramatically shrinks the computing time used by the interpretation software. To emphasize, the approach described in this work is thought to increase the efficiency of protein identification. It is not considered to process MS/MS data that is intended to be screened for protein posttranslational modifications.

Methods

Mass spectrometry

Commercially acquired proteins (α-amylase, amyloglucosidase, apo-transferrin, β-galactidase, carbonic anhydrase, catalase, phosphorylase B, glutamic dehydrogenase, glutathione transferase, immunoglobulin γ, lactic dehydrogenase, lactoperoxidase, myoglobin) were used, each in two independent preparations (each with a concentration of 100 fmol). For chromatography, a UltiMate Plus Nano-LC system. LC-Packings - A Dionex Co was used. Chromatographic mobile phases were: loading mobile phase 0.1% TFA in water, separation mobile phase A 5% acetonitrile in 0.1% aqueous formic acid and mobile phase B 80% acetonitrile, 20% water with 0.08% formic acid. The sample was loaded for 10 min onto a reversed phase trap column (PepMap C18, 300 μm ID × 5 mm length, 5 μm particle size, 100 Å pore size, LC Packings - A Dionex Co., not online with the separation column) at a flow rate of 20 μl/min and washed free of ion pairing agents and other impurities.

The gradient for separation of analytes starts at 10 min when the trap column is switched online with the separation column (PepMapC18, 75 μm ID × 15 cm length, 3μm particle size, 100 Å pore size) at 0.275 μl/min. The gradient used starts at 100% mobile phase A and changes to 50% mobile phase B from 10 minutes (trap column and separation column online) to 40 minutes. Additional wash step of 90% mobile phase B is incorporated in order to clean the separation column and elute hydrophobic analytes. After the separation, the trap column is switched offline and equilibrated with loading mobile phase. The analytical nano column is equilibrated with separation mobile phase A. The mass spectrometric data are only recorded for the time both columns are online.

The mass spectra were recorded with a Thermo Finnigan LTQ (positive nano-ESI mode, ionizing spray voltage: 1.5 kV, enhanced mass-spec full-scan range: 220 - 2000 amu). The much smaller datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) recorded with a 3D IT mass spectrometer (model DecaXP Thermo Finnigan) were reused from our previous work [24].

File processing and MS/MS data analysis

The MS/MS output was converted into mgf-files (MASCOT generic format). Each dataset was then separately processed using the MS Cleaner program (with default internal parameters), generating two new mgf-files with cleaned and bad (non-interpretable) spectra respectively. The MASCOT search parameters were the same in all runs (enzyme: trypsin; fixed modifications: carbamidomethyl (at cysteines) for BSA, ADH and TRF, carboxymethyl (at cysteiness) for other proteins; variable modifications: oxidation (at methionines); peptide charges: 1+, 2+ and 3+; mass values: monoisotopic; protein mass: unrestricted; peptide mass tolerance: ± 2 Da; fragment mass tolerance: ± 0.8 Da; max. missed cleavages: 1). The MASCOT search results output html-file was formatted with standard scoring, a significance threshold of p < 0.05, and an ion score cut-off for each peptide of 30. The non-redundant protein database (NCBI) was used (both for the local PC MASCOT installation and for the MASCOT Linux cluster).

In this work, we compare the MASCOT interpretation results of non-pre-processed tandem MS datasets with those obtained in a two-step preprocessing. First, each spectrum (.dta-file) is analyzed with the sequence ladder algorithm. Only those spectra that pass this test, are then processed with the background removal routines described in our previous publication [24].

The sequence ladder algorithm

For this algorithm, two parameters are critical - the values n(in amino acid residues), the minimal length of the sequence ladder, and s(in per cent), the fraction of peaks from the spectrum that is considered of high intensity. The number n can theoretically be just one (i.e., we would require just two high intensity peaks that are spaced by the mass difference corresponding to the mass of one of the amino acids); yet, larger values of n(for example, between two and six residues) represent stricter requirements to the sequence ladder. The other parameter s restricts the search space. For this purpose, the peaks in the spectrum considered (i.e., in one .dta-file) are sorted by intensity into a list with descending order. Only the first part of this list (the fraction s of the total set) is used for searching sequence ladders. The condition of s= 100% implies that all peaks are included; yet, considerably smaller values of s are desirable since they would help unselecting more non-interpretable spectra. Once the set of high-intensity peaks is defined, their pair-wise mass differences are compared in a systematic enumeration with the masses of amino acids residues (to select pairs of peaks separated by the mass of any of the amino acids within a user-defined accuracy) and it is tested whether a subset of peaks forms a sequence ladder of the required minimal length. If at least one such ladder is found, the search is stopped and the procedure is restarted with the next tandem MS spectrum in the dataset.

Modifications of the noise detection algorithm

If a spectrum has passed the sequence ladder test, it is handed over to a series of routines for noise and background detection. The procedures for removing multiply charged peak clusters with the etalon method and for the suppression of high-frequency noise with a low-pass filter after Fourier transformation have been described in a previous publication [24] in detail and have been applied without changes here.

The algorithm for the removal of latent periodic background (including deisotoping) received another option with respect to the determination of the base frequency of the noise. We observed that the determination of the base frequency f B in the first power spectrum (see sections 3.3 and 3.5 in ref. [24]) is, in rare cases, not always as unambiguous as in Figure 2A of ref. [24] since several almost equally intense peaks may appear in the second-level Fourier transform. Wrong base frequency f B detection leads to wrong multi-band rejection filter creation and a few interpretable spectra can be lost after applying this technique. This ambiguity can be avoided by not choosing the frequency of the most intense peak in the second-level Fourier transform. Rather, we propose to iterate through all possible base frequencies detected in this spectrum. For each of these frequencies, theoretical maxima and minima expected in first level Fourier transform are calculated. Best matching between the theoretical and experimental maxima and minima (see Figure 3 in ref. [24]) confirms the right base frequency. We call this method "soft recognition" of latent periodic noise which should be applied if minor improvements in sequence coverage (in rare cases, a single additional peptide) are more important than data size reduction; yet, it leads to an increment of about 10% of the computation time compared with the previous method [24].

Standalone implementation and cluster version

We created two implementations for MS Cleaner 2.0. A single-machine Windows version was used for most of the computations in this article and it is available for free download at the associated WWW site. A Unix-Port of the MS Cleaner 2.0 software is deployed in a clustered environment in order to guarantee scalability. The spectrum file is partitioned into workpackages, which are then handed over to a batch queuing system for scheduling on available nodes. Each node processes the spectra in its workpackage and transfers the results back to the controlling application where they are post-processed into the final good/bad spectra output. This version is the engine behind the MS Cleaner 2.0 WWW server.

WWW Supplement

At the WWW-site http://mendel.bii.a-star.edu.sg/mass-spectrometry/MSCleaner-2.0/, supplementary resources are available: all experimental mass-spectrometry data used in this work, the processed spectra, the user manual, default parameter datasets and a free downloadable Windows version of the program MSCleaner 2.0 as well as free access to a MSCleaner 2.0 WWW server accessing a local Linux cluster. Other implementations can be obtained on request.

Results and discussion

For the initial determination of optimal parameter ranges (sequence ladder length n and peak intensity threshold s), we used the datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) from our previous work [24] since they are quite small (less than 3000 .dta-files per set). We checked the influence of the preprocessing procedures on the spectrum interpretation with the MASCOT tool. A systematic analysis was performed; sequence ladder length was tested with values n between 2 and 6 and the high-intensity threshold s was varied from 5% to 35% (the sequence ladder was searched for only among the 5%, 10%, 15%, ..., or 35% of most intense peaks). The goal is to have as many unselected "bad" spectra as possible (the savings in computing time are about proportional to the fraction of spectra that is not handed over to the spectrum interpretation program) without losses of (i) MASCOT score, (ii) spectra giving peptide matches and (iii) sequence coverage.

Due to the space limitation, only the results of a parameter subset are presented (Table 1). As expected, the number of detected bad spectra increases with growing sequence ladder length n and decreasing intensity threshold s. We observe that the MASCOT score of the non-preprocessed data (586 for BSA, 224 for ADH and 588 for TRF; see rows with n= 0 and s= 0%) is considerably smaller than that of the cleaned datasets (often, by a factor of 2-5) regardless of the severity of data pre-processing. Thus, the reliability of the top protein hit in the database searches greatly increases by the background reduction, both by discarding bad spectra and by removing noise from spectra that can be interpreted in peptides. This alone is an interesting result.

Table 1 Influence of background removal on the recovery of BSA, ADH and TRF in MS/MS spectra of 100 fmol test samples. The original number of MS/MS spectra for the BSA (bovine serum albumine), ADH (yeast alcoholdehydrogenase) and TRF (human transferring) datasets (recorded on a DecaXP machine) are 2679, 2325 and 2608 respectively. The intensity threshold s (column 3) describes the search of the sequence ladder (length n in column 2) within the 15%, 20%, 25% or 30% top peaks (100% - all peaks are considered). The following three columns show the MS Cleaner output - number of spectra with background removal, number of unselected spectra and the MS Cleaner CPU time on a single-processor Windows XP computer (Pentium IV 2.4 GHz; to get exact measurements of computation time, we did not use the cluster version). The remaining four columns present the MASCOT output - the CPU time on the same machine, the protein score, the number of spectra matching peptides in a MASCOT search and the final sequence coverage. For each dataset, the first line shows the results for the case when MS Cleaner is not used for pre-processing and the MS/MS data is immediately interpreted by MASCOT.

The sequence coverage is more sensitive to the pre-processing parameters. For a sequence ladder length of n= 5 residues, we see a trend that sequence coverage is slightly decreased with respect to that of unprocessed data (41-54% instead of 55% for BSA, 21-31% instead of 39% for ADH, 45-48% instead of 47% for TRF). Sequence coverage is about the same or even slightly higher as for non-preprocessed data for sequence ladder lengths n= 3 and n= 4 and intensity thresholds s at and above 20%. With regard to the number of spectra that lead to a significant peptide match in the MASCOT search, the settings n= 3, s= 20%; n= 3, s= 25%; n= 4, s= 20% and n= 4, s= 25% are close to reproduce the result achieved with the unprocessed data for the BSA and TRF cases. Surprisingly, the number of peptide matches is slightly higher for s= 100% (all peaks are included in the sequence ladder search) than for the datasets without preprocessing. Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these two cases. For ADH, the number of spectra matching peptides is always somewhat lower if the tandem MS/MS data is pre-processed, although MASCOT score and sequence coverage do not suffer from choices of n= 3, or n= 4 and the higher values of s.

To detect a considerable fraction of the bad spectra and to reduce the time for interpretation by MASCOT, these results support the selection of a sequence ladder length equal to n= 4 and an intensity threshold of s= 20%. If the sequence coverage is more important than computational time savings, softer parameters can be chosen, for example with an intensity threshold of s= 25%. With these parameters, it is possible to eliminate more than 80% of all spectra in the datasets BSA, ADH and TRF by declaring them non-interpretable in oligopeptides (see Table 1). Minor sequence coverage loss, if at all observed, does not affect the interpretation result. Yet, the total computing time required for interpretation narrows up to only 20% of the original value. The computing time consumption for MS Cleaner alone in such a setting is ~2% of MASCOT time for non-preprocessed data (see Table 1); i.e., it is essentially negligible.

For further analysis of the algorithm's performance, large MS/MS datasets are necessary that are recorded from samples with known protein composition. For this purpose, we used solutions of commercially available proteins at 100 fmol concentration. The behavior of the MS Cleaner algorithms was tested over this large dataset of about 270000 spectra from 26 samples of 13 proteins (Table 2) generated by an LTQ device. We used sequence ladder length n= 4 with intensity thresholds s= 20% and s= 25% and contrasted the results both (i) with the MASCOT-based interpretation of non-preprocessed data and (ii) with sequence ladder 4 and the inclusion of all peaks (s= 100% threshold). We find that, as a rule, preprocessing reproduces or slightly improves the sequence coverage relative to the non-preprocessed data (100-110% for threshold s= 100% (columns A4 and A7), 100-108% for thresholds s= 20% and s= 25% (columns A4, A11 and A15)). Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these examples. This clear trend says that the preprocessing algorithm proposed here performs even better if it is supplied with more accurate data from the LTQ instrument as compared with those from the DecaXP. There is a trend for increased MASCOT scores (103-140% for threshold 100% (columns A3 and A6), 98-140% for s= 20% (A3 and A10) and 103-140% for s= 25% (A3 and A14) with an average of 110% regardless of threshold. The reduction of the dataset by unselecting spectra is significant (on average, 11% for threshold 100% (column A5), 63% for threshold s= 20% (column A8) and 53% for threshold s= 25% (column A13)). This means that the interpretation time with MASCOT reduces in a similar proportion.

Table 2 Performance of the MSCleaner version 2.0 over a large test set.

To summarize, the results support that testing spectra for interpretability in oligopeptides is a useful criterion for dataset reduction in protein mass spectrometry if a sequence ladder of a tetrapeptide segment is searched for among the 20% (or 25%) most intense peaks. This preprocessing is accompanied by an increase in MASCOT score and more significant top protein hits and it does not significantly affect sequence coverage. Running MS Cleaner 2.0 as a standard preprocessing step in peptide tandem MS data analysis for protein identification is recommended.

The idea of using short series of sequence ions (peptide sequence tags) as a specific identifier that speeds up searches for matches between spectra and sequences in databases (either by searching the database with the tag or by creating sequence tag database filters in order to reduce the size of a database via a preprocessing step) is extensively explored in the literature [2527]. It is interesting to see that this simple idea applied to the problem of recognizing spectra non-interpretable in oligopeptides greatly reduces the complexity of analyzing protein mass spectrometry data.