Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The application of proteomics in veterinary science is lagging behind in comparison to studies that have explored the potential of advanced proteomic technologies in human research. The situation is particularly acute in clinical medicine (Ceciliani et al. 2014). This slow start may turn out an advantage, as the recent boost in veterinary proteomics is contemporary with technological development (e.g. targeted proteomics or data-independent acquisition) and improvement of method accuracy and coverage. This progress is in turn challenging the design of automation procedures necessary to cope with ever-increasing amounts of data, thereby justifying a dedicated chapter on bioinformatics in this book.

Several generic reviews summarise the advent of shotgun proteomics (Aebersold 2003; Nesvizhskii 2010) that provides the original context for software development. The corresponding methodology mainly focused on protein identification based on sequence database search engines is now mature with its recognised shortcomings such as a strong bias towards abundant proteins, and the present chapter is centred on more recent efforts applicable to veterinary science.

One of the remarkable paradigm shifts from shotgun proteomics with data-dependent acquisition (DDA) is data-independent acquisition (DIA). It introduces the systematic fragmentation of precursor ions as opposed to a limited selection based on high-intensity peaks. Within successive and overlapping sliding windows along the m/z dimension, all precursor ions are fragmented resulting in a very populated tandem MS collection of redundant data (Chapman et al. 2014; Bilbao et al. 2015a). The sheer amount of mass spectra requires smart and accurate software for data analysis (Bilbao et al. 2015a). DIA is now well established for its ability to monitor detectable peptides with high sensitivity and reproducibility across multiple samples. Specifically, since DIA fragmentation of all detectable ions within a wide m/z range is systematically carried out regardless of intensity, then extracted ion chromatograms (XICs) can be generated at the fragment ion level. It is therefore particularly suitable to perform more consistent and accurate quantification. A direct consequence of using DIA methods is the expanding use of spectral libraries in particular as an alternative approach to sequence database search. Matching experimental to reference spectra is considerably faster and less error-prone than checking all possible theoretical spectra of the tryptic digest of proteins. Corresponding software tools for spectral library searching are reviewed in Griss (2016). However, the lack of a standardised file format and the possible incomplete coverage of spectral libraries are still limiting expansion. At present, library search is recognised as a helpful complementing approach to database search.

The second major development in recent proteomics is quantification and in particular targeted proteomics (Picotti and Aebersold 2012). Selecting the appropriate peptides, optimising fragment prediction and integrating these steps in a pipeline are the main bioinformatics challenges as summarised in Reker and Malmström (2012). These are definitely usable across the board, irrespective of the application.

Finally, a crucial point remains that of data and processing quality. Indeed, data should be appropriately processed with robust software ensuring reproducible and accurate results. Even with robust software and optimised settings, low-quality data yield poor and questionable results. Reproducible and high confidence results strongly rely on software usability and the ability to choose the most appropriate parameters, since different parameters could lead to different results and wrong interpretations or conclusions. Within all the steps in the proteomics workflow, from sample collection to data processing, “Mount Bioinformatics” remains the last and the highest peak to climb (Aebersold 2009). The challenges of bioinformatics in software for quantitative proteomics have been previously described (Cappadona et al. 2012). Important issues that directly impact the effectiveness of proteomic quantitation and common tasks in computational solutions to correct for the occurrence of these factors are well depicted. This chapter surveys the different resources and software tools that are currently in use for data reference and analysis in the field of proteomics and tackles veterinary proteomics issues in this context either referring to published work or to prospective solutions.

2 Data Resources

As presented by Perez-Riverol et al. (2015), the information generated in proteomics experiments is organised in three levels:

  1. (1)

    Raw MS data

  2. (2)

    Processed experimental data

  3. (3)

    Interpreted biological results

Level (1) corresponds to MS data collection through the ProteomeXchange protocol (Vizca𝚤no et al. 2014; Deutsch et al. 2017). In the past few years, this international initiative has allowed channelling all reported/published experimental data into three main repositories through standardised submission and dissemination pipelines.

Level (2) encompasses a significant number of databases storing peptide/protein identification and quantification whenever available.

Level (3) is associated with the concept of “knowledgebase” in which protein information is curated and recorded and can be searched, compared or mined.

2.1 Generic Proteomics Databases in a Nutshell

Several recent reviews comprehensively cover the topic of proteomics databases (Martens 2010; Perez-Riverol et al. 2015). Suffice to say that the collection of identified proteins is steadily growing and this broadens the extent of comparative or integrative approaches of data analysis. This is made possible through the generalised use of shared data formats and standards acknowledged by the Proteomics Standard Initiative (PSI; http://psidev.info). Table 1 summarises the range of formats in use in the current proteomics databases.

Table 1 Formats in use in the current proteomics databases

Martens and Vizca𝚤no (2017) very recently praised the “golden age” of proteomic data sharing precisely based on the availability and broad usage of standards.

2.1.1 Mass Spectrometry Data

File formats commonly used in MS-based proteomics are reviewed by Deutsch (2012). Mass spectra are stored in multiple repositories: PeptideAtlas (Farrah et al. 2013), GPMDB (Craig and Beavis 2004), Massive (https://massive.ucsd.edu) and PRIDE (Vizca𝚤no et al. 2015) to cite the most renowned. The vast majority of these publicly available datasets are generated for human and the main model organisms (S. cerevisiae, D. melanogaster, C. elegans, etc.). However, some dedicated resources have been created for farm animal proteomics as detailed in Sect. 2.2.1. One of the obvious advantages of such raw data availability is the potential for reuse and reanalysis. Reanalysis can be done individually, but note that PeptideAtlas and GPMDB, for instance, routinely reprocess many datasets with in-house bioinformatics pipelines.

The second remarkable feature of these repositories is the option of defining a characteristic spectral library. Despite the existence of institutions such as the National Institute of Standards and Technology (NIST) where reference spectral libraries are collected, it may be of interest to constitute a specialised library. As mentioned in the introduction, the current shift from DDA to DIA approaches emphasises the need for quality spectral libraries.

It is important to stress that minimal quality checks are undertaken so that the resulting data resources contain uneven quality spectra and subsequent more or less reliable identifications of peptides and proteins.

2.1.2 Integrated Data

UniProt (www.uniprot.org) and protein data collected at NCBI (https://www.ncbi.nlm.nih.gov/protein) are the main sequence sources used in database search engines. UniProt however has a greater level of data integration and contains a wealth of information beyond sequence features including protein expression and structure. It is, as such, the most popular resource used for characterising proteins as finely as possible.

Recently proteogenomics has become a popular approach for merging information originating from genomics, transcriptomics and proteomics studies. Indeed, DNA sequence and RNA expression data accumulation over the past three decades provides rich sources to be combined with ever-increasing proteomics data. Proteogenomics currently significantly contributes to the identification of sequence variants, especially in humans, and the tools developed in this context could easily benefit farm animal proteomics studies. At the time of writing this chapter, no farm animal proteogenomics study has been published.

2.2 Farm Animal Proteomics Dedicated Databases

Farm animal dedicated databases are usually focused on genetic mapping information. Though apparently discontinued, ArkDB (http://www.thearkdb.org/arkdb) still hosts genome mapping data from farmed and other animal species spanning genetic linkage and QTL (quantitative trait locus). A related website called ResSpecies (http://www.resspecies.org/resspecies/) gathers tools for displaying or exporting genotyping and phenotype data as well as population coverage and markers. Along the same lines, QTLdb [http://www.animalgenome.org/QTLdb, (Hu et al. 2015)] collects publicly available trait mapping data for a smaller range of animal and is associated with CorrDB, the animal trait correlation database (http://www.animalgenome.org/cgi-bin/CorrDB). In order to complete this abundant genetic information, some effort was invested into developing proteomic-centred resources especially in capturing mass spectrometry data.

2.2.1 Farm Animal Mass Spectrometry Data

As mentioned in Sect. 2.1.1, mass spectrometry data repositories have recently imposed a paradigm shift in considering published data. PeptideAtlas pioneered in collecting datasets dedicated to farm animals though admittedly, these are not as often updated or enriched with new data as the human collection is, as shown in the homepage where all newly included sets are listed.

The frequent use of mass spectrometry in studying milk and its constituents naturally led to the first initiative devoted to collecting MS data in bovine milk and mammary gland (Bislev et al. 2012). Then the Equine PeptideAtlas (Bundgaard et al. 2014) and the Pig PeptideAtlas were introduced (Hesselager et al. 2016), and finally the PeptideAtlas for the domestic chicken (McCord et al. 2017) is the most recent addition to the collection. This data can be queried from the generic interface (“Queries” tab), and corresponding data can be downloaded.

2.2.2 Farm Animal Integrated Data

The success of proteomic-based investigations largely depends on the availability of complete and annotated databases containing the gene and protein sequence information for the animal species of interest (Soares et al. 2012). Nonetheless, once proteins are identified by matching mass spectrometry data with sequence data, according to various strategies detailed in Sect. 3, stored and in the best of cases integrated knowledge of protein properties is very useful to potentially rationalise the content of the studied sample. There are very few such resources for FAP, and the following three cover the range:

  1. 1.

    ProteINSIDE specifically supports the annotation of farm animal proteomics experiments (http://www.proteinside.org; Kaspric et al. 2015). It stores information on bovine, sheep and goat. An in-built workflow processes lists of proteins input by a user to extract information on protein function, subcellular location (secreted/cytoplasmic) and interacting partners.

  2. 2.

    AgBase is an alternative resource for functional annotation though not proteomics-oriented (http://www.agbase.msstate.edu; McCarthy et al. 2010). It covers more farm animal organisms.

  3. 3.

    paxDB stores proteome-wide protein abundance information across organisms and tissues (http://pax-db.org; Wang et al. 2015b). Pig, bovine, horse and chicken are included.

3 Data Analysis Software

The use of two-dimensional gel electrophoresis (2DE) approaches combined with MS already allowed the characterisation of several distinct proteomes in different fields of animal science (Soares et al. 2012). However, the complexity of most proteomic samples challenges the separation power of such traditional techniques. More importantly, even with the current improvements in 2DE, it is still a manual and time-demanding process. In proteomics studies, liquid chromatography (LC) has been increasingly used as a replacement technique for gel electrophoresis, since it can be employed to analyse large numbers of samples in a faster, automated and more repeatable fashion. This trend can be observed in Fig. 1.

Fig. 1
figure 1

Estimated number of MS-based proteomics publications per year: gel electrophoresis (GE, red) vs. liquid chromatography (LC, blue). PubMed queries: “proteomics gel electrophoresis mass spectrometry” and “proteomics liquid chromatography mass spectrometry”

As one of the core technologies routinely used in advanced proteomics research, we focus this section on software and computational methods for analysing LC-MS data. We particularly describe the data and associated algorithms from a perspective of the acquisition methods by which it is generated, because understanding the process of data generation may turn out to be critical for successful data analysis and result interpretation.

Prior to LC separation, complex protein samples are digested into peptides. Since the chemistry of peptides is more uniform than the chemistry of proteins, tandem MS methods to sequence peptides are nowadays robust and mature (Yates III 2015). The most popular enzyme used for protein digestion is trypsin, which leads to peptides with C-terminally protonated amino acids, providing an advantage in subsequent MS-based peptide sequencing (Aebersold 2003). As peptides elute from the LC column, they are ionised by electrospray ionisation (ESI), and resulting ions are analysed by the mass spectrometer.

The MS analysis can be performed with different instruments and different operation modes. Broadly speaking, they can be classified as targeted and shotgun methods. The characteristics of the spectra and informatics approaches for these methods are discussed in the following subsections.

As illustrated in Fig. 2, we emphasise examples of chromatogram-based MS quantification workflows using the Skyline software tool (MacLean et al. 2010), developed and maintained by the MacCoss lab. Skyline is a Windows client, versatile and robust platform that can be used to analyse the different types of MS data here described. A key feature of Skyline is the extraction of MS data directly from many instrument vendor formats, that is, a conversion to open file formats is usually not required. Skyline is freely available and open source, with an interactive and rather intuitive graphical user interface (GUI) for visualisation, and several tutorials describing the usage are available. Results can be exported as text reports that researchers can further process using other tools.

Fig. 2
figure 2

Chromatogram-based MS quantification workflows in proteomics. Raw MS data generated by LC-MS analysis have different complexities depending on the purpose of the study. Targeted MS methods such as SRM require more effort prior to the actual LC-MS analysis, but the data contains single chromatogram traces for each targeted peptide transition and therefore requires less complex data processing. Shotgun MS methods acquire a full mass spectrum within each cycle. For DDA, MS1 data is used for quantification and MS2 data for identification. For DIA, the MS1 data is optional, and quantification is usually performed with the MS2 data, since the systematic fragmentation acquires continuous data for all fragments across the complete elution profile. The m/z range is typically divided in several precursor isolation windows, and thus DIA spectra contain several fragment ion maps. The spectral library provides information such as m/z and expected retention time used by Skyline as seeds for chromatogram extraction. The spectral library is not required for processing SRM data, but it might be used for designing the acquisition method, e.g., for selecting the best peptides and fragment ions. For DDA, the spectral library is used to extract the MS1 chromatograms for each identified peptide. For DIA, the spectral library is used to extract the MS2 chromatograms for the fragment ions of each identified peptide, from the spectra corresponding to the precursor isolation window. Statistical analysis to find differentially expressed proteins can be performed using MSstat, either as one of the external tools available within the Skyline GUI or independently using the exported results for the input of the R MSstat package. Further downstream analysis can be performed with tools such as Cytoscape and DAVID

Moreover, several tools are available as plug-in modules within Skyline. A framework called “external tools” allows researchers to integrate their tools into Skyline without modifying the Skyline codebase (Broudy et al. 2014). With a uniform interface for installation into Skyline, the external tools can be easily accessed by all users for downstream statistical analyses.

3.1 Targeted MS Methods

Selected reaction monitoring (SRM)—also referred to as multiple reaction monitoring (MRM)—is a targeted MS technique whereby a predefined series of transitions (precursor/fragment ion pairs) are selected by the two mass filters of a triple quadrupole instrument and monitored over chromatographic elution for precise quantification (Lange et al. 2008; Picotti and Aebersold 2012).

Since SRM strictly targets a predetermined set of peptides, it is particularly useful when only a handful of proteins from a complex background, such as those constituting a cellular network or a set of candidate biomarkers, need to be measured across multiple samples in a consistent, reproducible and quantitatively precise manner.

A useful discussion of the best practices for targeted MS measurements in biology and medicine, also applicable in animal science, can be found in Carr et al. (2014). The authors discussed the analytical goals and the experimental evidence needed to properly describe developed assays according to the required levels of performance, as well as the computational and statistical tools useful for the analysis of targeted MS.

3.1.1 Design of Targeted Acquisition Methods

In SRM-based proteomics, a significant amount of time is required for the design of the acquisition method or assay, and several informatics tools are available to assist this process. For instance, software to select proteotypic peptides, transitions and best acquisition settings include SRMCollider (Röst et al. 2012), MRMOptimizer (Alghanem et al. 2017) and PREGO (Searle et al. 2015).

We highlight the PNNL Biodiversity Plugin as one of the external tools available in Skyline (Degan et al. 2016). The tool summarises available mass spectrometry data in a pathway-centric view and facilitates querying it from a biological perspective to design quantitative experiments. Selected proteins and their underlying mass spectra are imported to Skyline for further assay design (transition selection). The PNNL Biodiversity Library catalogues MS/MS spectra from over 3 million peptides and 230,000 proteins from 118 distinct organisms across the tree of life all cross-referenced to KEGG pathways for intuitive biological interpretation.

To maximise the number of peptides that can be monitored in a single LC-MS analysis, scheduled SRM methods can be designed (Stahl-Zeng et al. 2007). Using information of the expected peptide elution time in the target list, computer programs automatically generate SRM acquisition methods where the transitions of a specific peptide are only targeted during a time window around its elution time. In this way, the number of peptides measured in a LC-MS run is increased without compromising the limit of detection or the quantitative accuracy.

3.1.2 Data Processing for SRM Quantification

Unlike in other MS-based proteomic techniques, no full mass spectra are recorded in SRM analysis. SRM data consist of a set of chromatographic traces with the retention time and signal intensity for each of the monitored transition. This non-scanning nature translates into an increased sensitivity by one or two orders of magnitude compared with conventional “full-scan” techniques. The two levels of quadrupole mass selection with narrow mass windows result in a high selectivity, as co-eluting background ions are filtered out very effectively.

Integration of the chromatographic peaks for each transition supports the relative or, if suitable heavy isotope-labelled reference standards are used, absolute quantification of the targeted peptides, which are used as a surrogate measure of the proteins of interest. Isotope labelling increases the complexity and costs of an experiment with the benefit of more precise quantification.

Other targeted methods, in which high-resolution full MS/MS spectra are acquired for each target peptide, such as parallel reaction monitoring (PRM) (Peterson et al. 2012), generate data for all detectable fragment ions. The third quadrupole of a triple quadrupole is substituted with a high-resolution and accurate mass analyser.

3.2 Shotgun MS Methods

In contrast to targeted MS methods, the so-called shotgun or bottom-up approach does not require predefined information about the proteins of interest or analytes in the sample. The MS instrument is operated to record the spectra of many peptides as possible, as they elute from the LC column and are ionised by electrospray.

3.2.1 Data-Dependent Acquisition and MS1 Quantification

Within each DDA cycle, ion signals are recorded in a MS1 or survey scan (precursor ion signals), and the top-N most abundant ions are then selected and serially isolated for fragmentation (MS/MS, MS2, or tandem MS) to generate structural information. Since fragmentation models are well characterised for amino acid sequences, theoretical spectra can be generated according to factors including peptide sequence and type of fragmentation, typically collision-induced dissociation (CID). Typically, most of the MS/MS data is highly pure (each spectrum contains fragments from mainly one peptide) and can be annotated with peptide sequences using search tools such as Mascot (Perkins et al. 1999), SEQUEST (Eng et al. 1994), Andromeda (Cox et al. 2011), X!Tandem (Craig and Beavis 2004) and MS-GF+(Kim and Pevzner 2014). A search tool in silico digests the protein sequences into peptides and generates theoretical spectra to score the observed tandem mass spectra against the predicted fragmentation (Nesvizhskii 2010).

While search tools produce a match for almost every input MS/MS spectrum, only a fraction of those peptide to spectrum matches (PSMs) are true. The most commonly used and accepted statistical confidence measure is the false discovery rate (FDR) (Benjamini and Hochberg 1995) adapted in proteomics (Elias and Gygi 2007)—also known as “target-decoy approach” (TDA)—as a summary statistics for the entire collection of PSMs. A decoy database is generated by reversing or shuffling the amino acids in the sequences of the reference database and thus included in the search to estimate the FDR as the expected proportion of incorrect PSMs among all accepted PSM (Nesvizhskii 2010). FDR analysis is typically included within the identification software; however, it can also be performed and refined using other tools such as MAYU (Reiter et al. 2009), Percolator (Käll et al. 2007; The M et al. 2016), iProphet (Shteynberg et al. 2011), PeptideProphet (Keller et al. 2002; Choi and Nesvizhskii 2007) and ProteinProphet (Nesvizhskii et al. 2003).

Properly estimating and controlling the FDR are essential steps in the computational pipeline for preventing subtle but profound errors in high-throughput science. It is necessary to place less emphasis on the number of identifications achieved and instead to value the work as a whole (Serang and Käll 2015).

After FRD analysis, confidently identified peptides can be used to build a library, which can be imported into Skyline to perform label-free relative quantification. Skyline extracts the precursor ion signals of each peptide from the MS1 raw data and computes the area under the peak or ion chromatogram from each peptide elution profile (Schilling et al. 2012).

Another popular method that has been traditionally used for label-free quantification in shotgun proteomics is the spectral count, which is based on the number of MS/MS spectra identified for each peptide sequence (Liu et al. 2004). Despite providing a rapid and semi-quantitative measure of abundance, spectral count-based quantification is affected by sample complexity; it has been found to often give irreproducible results and being unsuitable for quantifying low-abundance proteins (Cappadona et al. 2012; Ahrné et al. 2013). In contrast, chromatogram-based MS quantification based on integration of the peptide elution profile provides a level of accuracy comparable to labelling approaches.

3.2.2 Data-Independent Acquisition and MS2 Quantification

With recent developments in MS instrumentation, application of alternative MS operation modes such as DIA has become feasible (Chapman et al. 2014; Bilbao et al. 2015a). In contrast to DDA and by means of systematically parallelising the fragmentation, DIA avoids the selection of individual peptide ions during LC-MS analysis, therefore providing several advantages for characterising complex protein digests.

In a single injection or LC-MS analysis, DIA generates a comprehensive and permanent digital record of the sample (Liu et al. 2013). Because of the systematic sampling process, there is no need to reinject the sample for LC-MS analysis, as opposed to DDA and SRM. Acquired once and mineable forever, DIA spectra can be used to test for new hypothesis or reprocessed when a better-quality library is available (e.g. new genome available) or an updated or new processing software tool is released.

At the same time, in DIA, convoluted or multiplexed MS/MS spectra are generated without explicit association between each single precursor and its corresponding fragments. As a result, DDA search engines are not appropriate for processing DIA spectra, and several informatics tools have been developed recently to effectively process these complex datasets for identification: DIA-Umpire (Tsou et al. 2015, 2016), Group-DIA (Li et al. 2015) and MSPLIT-DIA (Wang et al. 2015a).

Here we focus on the targeted data extraction strategy related to the SWATH methodology (Gillet et al. 2012), where a list of peptide transitions (also called assay) built from previous DDA/SRM libraries is required. These libraries can be either collected from public repositories such as PeptideAtlas (www.peptideatlas.org), SRMAtlas (www.srmatlas.org) and SWATHAtlas (www.swathatlas.org) or generated by analysing the studied sample also in DDA mode to generate reference libraries (for a detailed protocol to generate high-quality reference libraries, see Schubert et al. (2015a)). A multiplexed variant of the SWATH methodology, termed MSX (Egertson et al. 2013, 2015), is also implemented within the Skyline software, supporting both acquisition method design and data processing for quantification.

As for MS1 quantification, Skyline extracts the ion signals of each peptide, but in this case fragment ion signals are extracted from the MS2 raw DIA files. Quantification is therefore performed using the elution profile of the peptide fragments, like for SRM, with the area under the peak or ion chromatogram as the abundance measure. Fragment ion abundances are subsequently aggregated into the corresponding peptides and proteins. A statistical measure of detection confidence is also computed for each peptide, using a similar version of the mProphet algorithm (Reiter et al. 2011) implemented within Skyline.

Other software tools for targeted data extraction are also available: Spectronaut (Bernhardt et al. 2012) (Biognosys proprietary software with free license for academics) and OpenSWATH (Röst et al. 2014) (open-source standalone tool or integrated module into the proteomics software OpenMS) (Sturm et al. 2008; Röst et al. 2016), including tools without GUI such as DIANA (Teleman et al. 2014) and SWATHProphet (Keller et al. 2015).

Recently, the performance of several of these tools was compared (Navarro et al. 2016). The authors observed similar reliable performances after software and parameter optimisation and concluded that targeted data extraction is a valid alternative to isotope-labelling-based methods.

Another consideration related to DIA data processing is the fact that the concurrent fragmentation of peptides has the drawback of increasing the likelihood of interference due to the overlap of fragment ions from different precursors. Several computational strategies can tackle this issue to further expand the benefits of DIA (Zhang et al. 2015; Bilbao et al. 2015b, 2016).

3.3 Statistical Analysis of Quantitative Results

Based on the quantification data, the next step is to determine candidate proteins showing significant differences across several sample types or conditions. The R statistical package MSstats (Choi et al. 2014) can be used as a standalone or as one of the external tools available within Skyline. MSstats can be used to interpret SRM, DDA and DIA quantification results.

The external set of QuaSAR tools (Abbatiello et al. 2010; Mani et al. 2012) automate and assist quantification of stable isotope dilution experiments. QuaSAR produces tabulated results for every peptide for essential statistics such as coefficient of variation, regression slope and intercept (with confidence intervals) and limits of detection and quantification as well as figures summarising their distribution and variation.

3.4 Automated PTM Detection

It is now well established that posttranslational modifications (PTMs) act in isolation or in combination with proteins for modulation and regulation purposes. In recent years, this field of investigation has led to intense bioinformatics development.

3.4.1 Main PTM Bioinformatics Resources

Two main databases are considered as references for storing PTM-related information. UniMod (http://www.unimod.org) is a comprehensive list of protein modifications for mass spectrometry applications. dbPTM (http://dbPTM.mbc.nctu.edu.tw) describes substrate specificity of PTM sites and provides functional annotation of PTM-related substrates and known interacting proteins. Neither specifically distinguishes between species as both aim at increasing numbers of their respective statistical tables. However, with this concern for broad coverage, dbPTM maintains a comprehensive list of databases and prediction software dedicated to individual or groups of PTMs. To complement the summary information associated with each resource, useful and more detailed comments can be found, for instance, in Kamath et al. (2011).

Over the past decades, the accumulation of sequence data led to implementation of PTM site prediction software based on amino acid patterns in aligned sequences. Each method was usually designed to identify individual PTMs. Many of these methods were developed with web interfaces and are hosted on the ExPASy (www.expasy.org/proteomics/post-translational_modification) and the CBS (www.cbs.dtu.dk/databases/PTMpredictions) servers whose creators pioneered in this field. More recently, the accumulation of mass spectrometry data to support PTM detection contributed to refining the reliability of prediction based on more comprehensive experimental data. This is, for instance, the case of phosphorylation sites through the use of resources such as PhosphoSite (http://www.phosphosite.org) that collects published mass spectrometry data for site annotation. In fact, understanding PTM occurrence goes along with studying the corresponding modifying enzyme(s). In many cases, these enzymes are not known, or their target is not precisely defined. For phosphorylation, KinBase (http://kinase.com/kinbase/) is the kinome reference, and the combined use of MS-validated sites and kinase specificity helps in refining site prediction as further explained in Sect. 3.4.2.

Although not fully considered as a PTM, protein cleavage should nonetheless be part of the modification landscape, and proteases have long been collected and classified in the MEROPS database (http://merops.sanger.ac.uk). The connection between proteolysis and PTMs is brought out in TOPFind, the N-/C-terminal modification database (http://clipserve.clip.ubc.ca/topfind). With a strong focus on human and mouse data, TOPFind also attempts to merge protein cleavage information with protein-protein interactions with the PathFinder tool (Fortelny et al. 2014) through mapping and modelling a protease interaction network.

Nonetheless, at present, besides published articles, UniProt remains the main source compiling information on alternative effectors of PTMs. In fact, examples of protein-protein interaction networks integrating PTM knowledge (occurrence + specific effector) are rare. They tend to be devoted to mapping data collected in eukaryotes such the yeast methylome (Erce et al. 2012) or the phospho-tyrosine interaction network in human (Grossmann et al. 2015).

3.4.2 Predicting PTM and Their Associated Enzymes

The association between a phosphorylated site and the kinase that actually performed the attachment on a serine, threonine or tyrosine residue is far from being obvious to predict, despite the clear need for getting a fuller picture of phosphorylation. So far, the most known tool that combines several sources of data to suggest site-enzyme associations for phosphorylation is networKIN (http://networkin.info). The method first uses a predictor to label a given phosphosite sequence with a kinase or kinase family. This predictor is trained with experimental data to ascertain the relationship between a site and a kinase. For instance, Scansite (http://scansite3.mit.edu) mostly relies on peptide library screening, phage display and mass spectrometry experiments to get enough examples of labelled sites and identify characteristic sequence patterns for a site in definite association with a known kinase. Nonetheless a high level of ambiguity persists and as a second step in order to narrow down the options, networKIN includes contextual information by extracting knowledge of protein-protein interactions centred on the kinases of interest from a database of interactions. By calculating the proximity of the substrate to all kinases in a network of functional relationships, networKIN infers the most likely candidate kinase for each site.

The knowledge of phosphorylation is by far more advanced than that of other common PTMs such as glycosylation. Despite the possible mapping between a glycan structure and the set of enzymes that are required for its synthesis, the characterisation of intact glycopeptides remains a definite challenge. As it is, most glycan structures have been solved after being cleaved off their natural support, while protein glycosylation sites are identified after removing the attached glycans. In the end, key information on the glycoconjugate is lost. The correlation between glycan structures and glycoproteins can be restored manually through literature searches that are both labour- and time-consuming. This is, for example, the purpose of UniCarbKB (www.unicarbkb.org). In this context, the design of prediction tools linking a glycosite with the appropriate glycosyltransferases may happen in a not too distant future. Information on these enzymes has accumulated in the CAZy database (www.cazy.org) over the past two decades.

3.4.3 PTM Discovery

Mass spectrometry is the method of choice for detecting PTMs. Since the early days of software development for analysing mass spectra, the concern for identifying possible mass shifts corresponding to the addition or removal of chemical groups of known masses has been shared by bioinformaticians. Then it appeared that, conversely, the occurrence of regular and identical mass shifts of unknown origin in MS2 data could be a source of new knowledge, and a range of tools was then developed to perform the so-called open modification search, that is, PTM search with no a priori (Ahrné et al. 2010). This approach was scaled to process high-throughput proteome data with the prospect of discovering unexpected modifications (Na et al. 2011; Horlacher et al. 2015). However, data interpretation remains a challenge, and findings require experimental validation. Nonetheless, large-scale processing supports scientists in investigating and discovering new leads.

It is worth noting that the top-down proteomic approach is very promising for generating mass information on PTMs (Smith et al. 2013). The ProSight software that is commonly run to analyse this particular type of data is adapted to the identification of a broad range of PTMs (Fellers et al. 2015).

Recently, the potential of label-free by PRM was demonstrated for targeted phosphoproteome analysis (Lawrence et al. 2016). The authors also created a web-based assay development application that queries the database for optimal peptide selection and retention time scheduling (phosphopedia.gs.washington.edu).

3.4.4 PTM Combination

The next challenge is to identify the constraints that rule PTM cooperative and/or antagonist effects as pointed in Venne et al. (2014). Indeed, PTMs can be mutually exclusive such as the phosphorylation and the O-GlcNAcylation of serine and threonine of signalling proteins presented very early on as the “yin-yang hypothesis” (Hart et al. 1995), but they can also be cooperative as in the well-studied case of histones (Schwammle et al. 2014). Data has accumulated on a few other proteins such as tubulins (Verhey and Gaertig 2007), the FoxO regulator (Calnan and Brunet 2008) or chaperones (Cloutier and Coulombe 2013). However, a critical mass of experimental data is still missing to design appropriate bioinformatics tools supporting the discovery of PTM co-occurrence rules that could potentially explain corresponding effects on protein function. In the meantime, simple co-occurrence is collected, for instance, in the PTMCode database (Minguez et al. 2014) that shows potential trends. This field is likely to blossom in the years to come.

3.5 Interactomics and Data Interpretation

A list of identified proteins is not sufficient for characterising a sample. The challenge is to understand why some proteins are co- or differentially expressed and what are the underlying processes explaining cooperative or concurrent activity.

Detailed information about how MS-based proteomics has been applied to network biology, to detect and quantify perturbation-induced network changes and to correlate network dynamics with cellular phenotypes can be found in Bensimon et al. (2012).

The most common approach is to extract protein information from the gene ontology (GO) that assigns relevant terms from controlled vocabularies to specify the protein subcellular location, its function and its contribution to one or more processes (see www.geneontology.org). Then, when proteins share terms characterising a location or a function or a process, this type of similarity helps in shaping a hypothesis. Alternatively, the knowledge of protein-protein interactions stored in databases such as IntAct (http://www.ebi.ac.uk/intact/) or BioGrid (http://www.biogrid.org) can be used to build an interaction network. Note that readily available interaction networks can be visualised and queried in STRING (http://string-db.org). In all cases, integrated tools in the cited data resources provide support for protein data interpretation. One of the most popular open-source software for integration, visualisation and analysis of biological networks is Cytoscape (Shannon et al. 2003). This popularity has been driven by the ability of extending Cytoscape functionality through plugins (Saito et al. 2012), yielding a powerful and heterogeneous set of tools and enabling a broad community of scientists to contribute. This part is illustrated further in the following section with concrete examples.

4 Applications

There are roughly three types of applications where high-throughput proteomics methods are used and therefore bioinformatics is necessary. In most instances detailed below, resources tailored for the study of model organisms spanning mainly humans, mouse, drosophila and yeast show some limitation in supporting the interpretation of experimental results.

4.1 Animal Health

In much the same way as in medical studies, one of the goals of using proteomics is to identify reliable disease biomarkers for diagnosis or prognosis. For example, serum protein profiles have been used to detect infectious disease in pigs (Koene et al. 2012). However, the prevalence of genomics approaches remains as illustrated in a recent comprehensive review on bioinformatics tools available to study parasites of veterinary significance (Cantacessi et al. 2012). Omics data integration is still a prospect at this point in time.

As highlighted in Sect. 3.5, interactomics provides the most attractive ground for the interpretation of proteomics data. The identification of protein complexes in a sample is a first step in rationalising and understanding protein co-expression. For example, cellular proteins of the host-forming complexes with specific proteins of the porcine reproductive and respiratory syndrome virus (PRRSV) essential in virus replication were identified by pull-down experiments in Dong et al. (2016). Binding partners of viral proteins were then systematically mapped on known pathways with a piece of software (https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/) to shortlist those that involve the mostly expressed partners. This study along with further experimental work led to identify and validate the role of the HSP70 chaperone as key to transcription and replication of PRRSV. More generally, the investigation of host-pathogen interactions is obviously suited to proteomics-based approaches, and the wealth of information stored in bioinformatics databases provides useful support. Proteomics is increasingly introduced in veterinary medicine (Ceciliani et al. 2014), and data accumulation will soon popularise the use of bioinformatics to allow for information-rich comparative studies.

Furthermore, Bundgaard et al. (2016) applied SRM to investigated levels of eight inflammatory acute phase proteins in interstitial fluid from wounds in horses. Selection of protein-specific peptides was performed using the equine PeptideAtlas website.

Packialakshmi et al. investigated proteomics differences in the plasma of healthy and femoral head necrosis-affected chickens using shotgun MS methods (Packialakshmi et al. 2016). MS/MS data was converted from proprietary format to mzXML files using the instrument vendor software and submitted to global proteome machine (GPM; http://www.thegpm.org) for identification with X!Tandem.

Proteins with at least one unique peptide and 5% FDR were considered true for protein identifications. The results were downloaded as *.xml files for Skyline software. After MS1 signal extraction, label-free quantitation was performed using MSstats as one of the external tools directly available in Skyline. Group comparison function was used for the label-free quantitation and to generate the volcano plot that shows the differentially expressed proteins.

The list of proteins was mapped to the corresponding ensemble gene IDs using Biomart and analysed for relative enrichment, clustering and GO annotations using DAVID (Huang et al. 2009a, b).

4.2 Adipose Tissue and Muscle Studies

The lack of farm animal dedicated resources in proteomics motivated the development of the ProteINSIDE database cited in Sect. 2.2.2. It was used in a study of bovine adipogenesis and myogenesis as well as the balance between these two processes (Kaspric et al. 2015). Previously analysed mass spectrometry data of adipose (Taga et al. 2012) and muscle (Chaze et al. 2009) foetal bovine tissues led to identify proteins, which were poorly annotated. Data and tool integration of ProteINSIDE supported the interpretation of protein lists. In particular protein-protein interactions, as collected from various sources as cited in Sect. 3.5, were used as the main piece of information for identifying clusters of functionally interconnected proteins. In the muscle they were associated with four processes, muscle development, cell proliferation, energetic complex and respiratory chain, and in the adipose tissue with seven, cell proliferation, proteasome complex, complexes I and III of the respiratory chain, redox activity and differentiation and metabolism of adipose tissue. The overlap between the two tissue types led to suggest possible crosstalk mechanisms.

Other authors have used bioinformatics sequence analysis tools to identify proteotypic peptides in an attempt to define biomarkers of meat authenticity following a targeted proteomics approach (Orduna et al. 2015). In the same vein, Stella and co-workers used SRM to quantify 12 potential protein markers of skeletal muscle and detect anabolic treatments with dexamethasone (Stella et al. 2016). The listed proteins were markers identified in a previous study applying a two-dimensional difference gel electrophoresis proteomics approach. A scheduled SRM method was developed using Skyline software to monitor 24 signature peptides from the 12 considered protein markers (two peptides per protein). For each peptide, 3 precursor-to-product ion transitions were targeted. Peptide quantification was achieved using a spike-in dedicated internal standard for each target. To this end, 13C/15N isotopically labelled peptides sharing the same sequence but with a defined mass shift were used. Peptide quantification was achieved using Skyline software integrating the area of the chromatographic peak of each peptide and the corresponding labelled internal standard.

Using R, protein abundances were graphically described using box plots, and potential differences of protein concentration values among different animal groups were explored performing one-way analysis of variance (ANOVA) on the two animal sets.

4.3 Milk Proteome and Glycoproteome

As mentioned in Sect. 2.2.1, the analysis of protein content in farm animal milk has long been the focus of veterinary and biological science. It is not surprising that several groups already undertook global studies such as a more detailed pathway mapping of 106 human milk proteins (D’Alessandro et al. 2010). In this in silico study, pathway analysis software based on knowledge of metabolism is used to identify cell proliferation and differentiation pathways on top of the usual nutrition and immune functions known to characterise milk proteins. This provides evidence of tissue growth and organ development capacities of milk proteins based on collecting public data and using pathway analysis tools.

To account for quantitative aspects, mass spectrometry and bioinformatics tools were used in two recent studies (Tacoma et al. 2016; Zhang et al. 2016). While the former reference conventionally relies on gene ontology to track functional features of differentially expressed proteins in two dairy cow breeds, the latter brings the comparison of the human and bovine milk proteomes over lactation further. In this study, MS data was first differentially quantified using the MaxQuant software, and interactions between the most co-expressed proteins were then derived from STRING (see Sect. 3.5). Results reveal the interconnected roles of milk proteins in nutrition and protection to the neonate.

Finally, the importance of glycosylation in milk needs to be acknowledged, and quantitative studies are also undertaken in a systematic way. For example, Huang et al. (2016) used SRM to quantitate seven human milk proteins and their glycoforms.

5 Conclusion

The application of proteomics in veterinary studies has been moving from the initial qualitative description towards the quantification stage, where experiments to identify and quantify protein changes in different samples of particular tissues or fluids become more common (Ceciliani et al. 2014). Application of untargeted label-free quantification methods is rapidly increasing in the proteomics field in general, fuelled by the advantages of DIA methods and development of software tools and innovative algorithms, which already have been shown implementations to estimate absolute cellular protein concentrations (Schubert et al. 2015b). In this context, further improvements and developments of new computational strategies for quantification are expected.

Numerous studies have been published to date in domestic and farm animal proteomics; among several challenges and limitations, we highlight the lack of detailed information of MS-based proteome informatics tools in the context of farm animal reviews/resources. Only general notions are described but not specific tools and very few references.

The previously described difficulty to use MS proteomics software (Cappadona et al. 2012), associated with the lack of appropriate documentation or with a poor graphical user interface, is still an issue. Fortunately, this is changing with tools like Skyline. Skyline is an active project and continues expanding, for instance, it currently supports chromatogram extraction from MS1 and MS2 spectra with the additional ion mobility separation recently available in commercial instruments (Baker et al. 2015). An increasing number of available tools within open-source and collaborative projects can be expected, and we encourage development of more robust MS tools that can be used by researchers nonspecialised in mass spectrometry or informatics.