Summary
The capacity of proteomics methods and mass spectrometry instrumentation to generate data has grown substantially over the past years. This data volume growth has in turn led to an increased reliance on software to identify peptide or protein sequences from the recorded mass spectra. Diverse algorithms can be applied for the processing of these data, each performing a specific task such as spectrum quality filtering, spectral clustering and merging, assigning a sequence to a spectrum, and assessing the validity of these assignments.
The key algorithms to mass spectral processing pipelines are the ones that assign a sequence to a spectrum. The most commonly used variants of these are crucially dependent on the information contained in the sequences database, which they use as a basis for identification. Since these sequence databases are constructed in different ways and can therefore vary substantially in the amount and type of data they contain, they are also discussed here.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sadygov, R. G., Cociorva, D. and Yates, J. R. (2004) Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 1, 195–202.
Nesvizhskii, A. I., Vitek, O. and Aebersold, R. (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787–797.
Matthiesen, R. (2007) Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 7, 2815–2832.
Perkins, D. N., Pappin, D. J., Creasy, D. M. and Cottrell, J. S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567.
Cottrell, J. S. (1994) Protein identification by peptide mass fingerprinting. Pept Res 7, 115–124.
Zhang, W. and Chait, B. T. (2000) ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal Chem 72, 2482–2489.
Eng, J. K., McCormack, A. L. and Yates, J. R. (1994) An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J Am Soc Mass Spectrom 5, 976–989.
Craig, R. and Beavis, R. C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467.
Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R. (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74, 5383–5392.
Zhang, Z. (2004) De novo peptide sequencing based on a divide-and-conquer algorithm and peptide tandem spectrum simulation. Anal Chem 76, 6374–6383.
Taylor, J. and Johnson, R. (2001) Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal Chem 73, 2594–2604.
Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty-Kirby, A. et al (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 17, 2337–2342.
Grossmann, J., Roos, F., Cieliebak, M., Liptak, Z., Mathis, L., Muller, M. et al (2005) AUDENS: a tool for automated peptide de novo sequencing. J Proteome Res 4, 1768–1774.
Frank, A. and Pevzner, P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 77, 964–973.
Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y., Shima, T., Okumura, N., Besada, V. et al (2000) Automated interpretation of low-energy collision-induced dissociation spectra by SeqMS, a software aid for de novo sequencing by tandem mass spectrometry. Electrophoresis 21, 1694–1699.
Dancik, V., Addona, T., Clauser, K., Vath, J. and Pevzner, P. (1999) De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6, 327–342.
Pitzer, E., Masselot, A. and Colinge, J. (2007) Assessing peptide de novo sequencing algorithms performance on large and diverse data sets. Proteomics 7, 3051–3054.
Pevtsov, S., Fedulova, I., Mirzaei, H., Buck, C. and Zhang, X. (2006) Performance evaluation of existing de novo sequencing algorithms. J Proteome Res 5, 3018–3028.
Mann, M. and Wilm, M. (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66, 4390–4399.
Mørtz, E., O’Connor, P. B., Roepstorff, P., Kelleher, N. L., Wood, T. D. et al (1996) Sequence tag identification of intact proteins by matching tanden mass spectral data against sequence data bases. Proc Natl Acad Sci U S A 93, 8264–8267.
Tabb, D. L., Saraf, A. and Yates, J. R. (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 75, 6415–6421.
Martens, L., Hermjakob, H., Jones, P., Adamski, M., Taylor, C., States, D. et al (2005) PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545.
Jones, P., Cote, R. G., Martens, L., Quinn, A. F., Taylor, C. F., Derache, W. et al (2006) PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 34, D659–D663.
Desiere, F., Deutsch, E. W., Nesvizhskii, A. I., Mallick, P., King, N. L., Eng, J. K. et al (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol 6, R9.
Craig, R., Cortens, J. P. and Beavis, R. C. (2004) Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 3, 1234–1242.
Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., King, N., Stein, S. E. et al (2007) Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667.
Martens, L., Nesvizhskii, A. I., Hermjakob, H., Adamski, M., Omenn, G. S., Vandekerckhove, J. et al (2005) Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories. Proteomics 5, 3501–3505.
Gentzel, M., Köcher, T., Ponnusamy, S. and Wilm, M. (2003) Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics 3, 1597–1610.
Zhang, X., Asara, J. M., Adamec, J., Ouzzani, M. and Elmagarmid, A. K. (2005) Data pre-processing in liquid chromatography-mass spectrometry-based proteomics. Bioinformatics 21, 4054–4059.
Gevaert, K., Goethals, M., Martens, L., Van Damme, J., Staes, A., Thomas, G. R. et al (2003) Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat Biotechnol 21, 566–569.
Yi, J., Kim, C. and Gelfand, C. A. (2007) Inhibition of intrinsic proteolytic activities moderates preanalytical variability and instability of human plasma. J Proteome Res 6, 1768–1781.
Creasy, D. M. and Cottrell, J. S. (2002) Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2, 1426–1434.
Falkner, J. and Andrews, P. (2005) Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined. Bioinformatics 21, 2177–2184.
Salmi, J., Moulder, R., Filén, J., Nevalainen, O. S., Nyman, T. A., Lahesmaa, R. et al (2006) Quality classification of tandem mass spectrometry data. Bioinformatics 22, 400–406.
Bern, M., Goldberg, D., McDonald, W. H. and Yates, J.R.3rd (2004) Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20 Suppl 1, i49–i54.
Hoopmann, M. R., Finney, G. L. and MacCoss, M. J. (2007) High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal Chem 79, 5620–5632.
Wong, J. W. H., Sullivan, M. J., Cartwright, H. M. and Cagney, G. (2007) msmsEval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 8, 51.
Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogelzang, M., Eddes, J. S., Gruissem, W. et al (2006) Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics 5, 652–670.
Flikka, K., Martens, L., Vandekerckhove, J., Gevaert, K. and Eidhammer, I. (2006) Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094.
Xu, M., Geer, L. Y., Bryant, S. H., Roth, J. S., Kowalak, J. A., Maynard, D. M. et al (2005) Assessing data quality of peptide mass spectra obtained by quadrupole ion trap mass spectrometry. J Proteome Res 4, 300–305.
Purvine, S., Kolker, N. and Kolker, E. (2004) Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. OMICS 8, 255–265.
Liu, H., Sadygov, R. G. and Yates, J.R.3rd. (2004) A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem 76, 4193–4201.
Ishihama, Y., Oda, Y., Tabata, T., Sato, T., Nagasu, T., Rappsilber, J. et al (2005) Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol Cell Proteomics 4, 1265–1272.
Tabb, D. L., MacCoss, M. J., Wu, C. C., Anderson, S. D. and Yates, J. R. (2003) Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal Chem 75, 2470–2477.
Tabb, D. L., Thompson, M. R., Khalsa-Moyers, G., VerBerkmoes, N. C. and McDonald, W. H. (2005) MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J Am Soc Mass Spectrom 16, 1250–1261.
Flikka, K., Meukens, J., Helsens, K., Vandekerckhove, J., Eidhammer, I., Gevaert, K. et al (2007) Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, 3245–3258.
Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E. and Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988.
Prince, J. T., Carlson, M. W., Wang, R., Lu, P. and Marcotte, E. M. (2004) The need for a public proteomics repository. Nat Biotechnol 22, 471–472.
Mead, J. A., Shadforth, I. P. and Bessant, C. (2007) Public proteomic MS repositories and pipelines: available tools and biological applications. Proteomics 7, 2769–2786.
Hermjakob, H. and Apweiler, R. (2006) The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics 3, 1–3.
Acknowledgments
Lennart Martens thanks Prof. Dr. Joël Vandekerckhove and Prof. Dr. Kris Gevaert for sharing their extensive knowledge on proteomics, and Henning Hermjakob for support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Martens, L., Apweiler, R. (2009). Algorithms and Databases. In: Reinders, J., Sickmann, A. (eds) Proteomics. Methods in Molecular Biology™, vol 564. Humana Press. https://doi.org/10.1007/978-1-60761-157-8_14
Download citation
DOI: https://doi.org/10.1007/978-1-60761-157-8_14
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-60761-156-1
Online ISBN: 978-1-60761-157-8
eBook Packages: Springer Protocols