1 Introduction

Modern mass spectrometry-based proteomics deals with large datasets and complex analytical workflows and, therefore, depends vitally on the software support. Traditionally, proteomics software was written in statically-typed languages such as C/C++ and Java [13] (see http://www.ms-utils.org for reference). While showing high performance, they are relatively slow in terms of development speed. In scientific programming and data analysis, this disadvantage is especially noticeable: very often scientific software is developed in the “exploratory” mode where the set of specifications is worked out along with the code itself.

The scripting languages represent another alternative. Their unique flexibility and ease of development make them highly attractive for development of scientific applications [4, 5]. However, there are very few proteomic projects written in dynamic general-purpose languages [6]. This situation is changing now. Particularly, Python is gaining popularity in the proteomic community in recent years [79]. This trend can be explained by the unique combination of its properties: high speed of development, interactivity, enormous choice of high-quality libraries, including packages for numeric calculations, statistics and plotting; relatively painless parallelization and low-level optimization. These qualities of Python are already appreciated by scientists from many other scientific fields [10].

In this communication, we present a set of easy-to-use annotated Python tools for both researchers and software developers in the field of proteomics, and, specifically, LC-MS/MS data mining.

2 Methods

Pyteomics is designed as a toolbox that assists bioinformaticians in developing their own proteomic projects in Python. Scripts for exploratory reproducible data analysis are forming one class of such projects: in fact, most modules in Pyteomics initially appeared as byproducts of such research projects. Another type of tasks targeted by Pyteomics is rapid software prototyping. While the performance of Python may be an order or two less than that of low-level languages, this difference is compensated by increased speed of development at the stage of prototyping. Once the final architecture of a program is designed, the performance gap can be bridged by rewriting the critical sections of code in Cython [11] or C++.

We did not aim to create a set of solutions for a particular problem in proteomics (e.g., peptide-spectrum matching or data management). Instead, we wanted to design a set of reusable generic components allowing accomplishment of a large variety of specialized tasks. The modular, functional design of the library is intended to facilitate both borrowing of the code from the library and integrating the code base of other projects.

In our opinion, a software project should not be characterized only by the number of functions it contains. API documentation and tutorials, unit testing, packaging and distribution, code management—all these components comprise the “good programming practices” often omitted in scientific programming [12]. In Pyteomics, we have paid special attention to these issues, trying to make its usage as easy as possible. In this regard, the choice of Python is straightforward because it provides well-developed solutions addressing these problems.

3 Module Contents

mass: an interface for calculation of masses of peptides and proteins, modified meptides, meptide ions, and isotopic distributions. The module offers tools for calculation of masses (and m/z) of specific ion types and for specific isotopic states, estimation of isotopic state abundances, and determining the most probable isotopic state of a molecule.

achrom: peptide retention time prediction using the additive model of peptide separation by liquid chromatography [13]. The model has additional corrections for the length of a peptide and the terminal amino acids. A simple interface allows calibrating the model for a specific chromatographic setup.

biolccc: a software implementation of BioLCCC, a physical model of polypeptide chromatography [14]. BioLCCC not only allows prediction of retention times of peptides, but may also be used to estimate the effect of various parameters of experimental setup on the selectivity of peptide separation.

electrochem: prediction of polypeptide charge and isoelectric point using Henderson-Hasselbalch equation.

mzml, fasta, mgf, pepxml, mzid: parsers for the community standards of proteomic data representation. These modules allow full access to the information stored in the corresponding formats. The retrieved data are converted to the standard Python data types. The structure of the original files is preserved where possible. Additionally, fasta and mgf modules allow generation of FASTA databases (including decoy database generation) and MGF files. The mzml module allows extracting information from files in mzML format [15]. Its design enables parsing large files in a memory-efficient manner, making it easy to process files of several Gb in size on virtually any machine. The parsing time is approximately 10–15 s/Gb on a PC with a 2.4 GHz Intel CPU. The mzid module provides support for the recently introduced mzIdentML format [16].

parser: a technical module that simplifies handling of peptide sequences. It allows in silico cleavage of polypeptide sequences, generation of modified sequences, etc.

4 Results and Discussion

Pyteomics implements a wide range of methods that cover all stages of typical bottom-up and top-down proteomic workflows as shown schematically in Figure 1: in silico protein digestion, retention time and isoelectric point prediction, mass (and m/z) calculation, combined with full access to data stored in protein sequence databases, LC-MS/MS data files, and search engine output. This enables combining experimental data and calculations in all possible ways.

Figure 1
figure 1

Functionality of pyteomics modules (inner circles) covers all stages of a typical proteomics experiment (outer circles) for data processing in bottom-up, top-down, and middle-down approaches

Pyteomics was already used in a number of research projects [1720]. In the recent study on peptide sequence scrambling [17], pyteomics.mass was used to perform thorough annotation of all fragment peaks in the high-quality databases of CAD and HCD fragmentation spectra. Pyteomics.achrom and pyteomics.parser were applied to evaluate the degree of orthogonality between HILIC and reversed-phased peptide separation techniques [19]. Apart from the published studies, pyteomics is commonly employed in our laboratory and by our collaborators for routine tasks such as proteomic search engine optimization or post-search analysis of peptide-spectrum matches as well as for illustrative/educational purposes.

5 Conclusions

Pyteomics provides a set of tools for rapid proteomics software development. It implements low-level mass spectrometry abstractions in high-level Python programming language. The advantages of the library are the simplicity of design, diversity of implemented modules, and careful implementation in accordance with good programming practices. The library may find its application niche wherein the code flexibility is the key feature (e.g., in exploratory data analysis and/or software prototyping). The source code of the library, documentation, and installation packages are available for the public under the Apache license, ver. 2.0 (http://www.opensource.org/licenses/Apache-2.0). The license allows copying, distributing, and modifying the work, or using any parts of it (including commercial use) with the condition of attribution to the original authors. Pyteomics features Python 3 support since ver. 1.2.0.

Feature addition is made via source code repository at http://hg.theorchromo.ru/pyteomics, thus allowing accepting patches from the community and complying with the continuous integration paradigm. The directions of further development include optimization of the critical segments of code with Cython, as well as adding support for other file formats and new features resulting from community feedback. We also plan to interact actively with other software projects to provide a consistent and feature-rich Python environment for bioinformaticians.