Keywords

4.1 Introduction

Unraveling protein biosynthesis is undoubtedly a multi-omics integration endeavor. From the DNA template (genomics ) a region is transcribed (transcriptomics) and subsequently translated (translatomics) into protein products (proteomics ). The aforementioned omics fields definitely intertwine, but are likewise considered self-sufficient, demonstrated by their vast complexity. Integration of matching multi-omics datasets, although challenging, can lead to more sound results and even new insights. Advances in bioinformatics have facilitated this multi-omics integration and expert tools became available to tackle specific parts of proteogenomics analyses (e.g., PROTEOFORMER (Crappé et al. 2014a), PEPTIDESHAKER (Vaudel et al. 2015b), also see (Menschaert and Fenyö 2015) for a review of bioinformatics tools available in the proteogenomics field). An intriguing multi-omics empowered field tries to identify novel protein coding sequences. Direct assessment of proteins through mass spectrometry based proteomics analysis, combined with genomics , transcriptomics and translatomics information provides the necessary means to unravel the information flow from DNA to proteins (Wang and Zhang 2014). Particularly, the identification of micropeptides , translation products of small open reading frames , and neo-antigens, peptides resulting from proteins variants conceivably recognized by the immune system, are discussed in this book chapter. First, we will briefly describe the MS-based proteomics technology, highlighting the necessity for multi-omics integration in the research fields mentioned above.

As mentioned, the preferred methodology for protein / peptide identification is mass spectrometry (MS) , a technique with high sensitivity and specificity (Cheng et al. 2014; Ryu 2014), capable of detecting up to 10,000 proteins from a single sample (Nagaraj et al. 2011). The global workflow in MS consists of enzymatic digestion of proteins extracted from the sample into peptides that are subsequently fragmented and analyzed by a mass spectrometer, providing peptide fragmentation spectra by registering the mass-to-charge ratio of ionized peptide fragments. Peptides are identified through database search engines (e.g., X!tandem (Craig and Beavis 2004), Myrimatch (Tabb et al. 2007), MS-GF+ (Kim and Pevzner 2014; Granholm et al. 2014), Comet (Eng et al. 2015), MS Amanda (Dorfer et al. 2014)). A peptide-spectrum match (PSM) score is calculated by comparing experimental spectra against theoretical spectra, generated after in silico digestion of all proteins provided in a sequence database. Statistical validation methods in MS -based proteomics compute the false discovery rate (FDR) by means of a target-decoy approach assuming the reference database to contain the “true” pool of sequences represented in the sample (Hernandez et al. 2014). Consequently, deviation from this assumption impairs validation, implying that the main paradigm here is not to use the most exhaustive reference database, but to adversely focus on the most suitable reference database representing the true nature of the biological sample (Gupta et al. 2011; Nesvizhskii 2010; Wang et al. 2009a; Keller et al. 2002). Obviously, small proteins (micropeptides ) produce less cleaved peptides and are often not present in reference protein databases , implicating their MS identification. Also, distinguishing resembling peptides can be complicated, as is frequently the case for neo-antigen identification.

Search engines and algorithms will definitely influence the peptide identification rate, but the reference database construction is pivotal, as inclusion is a prerequisite for identification. Uniprot-KB (EMBL et al. 2013; Apweiler et al. 2014) is mostly used as the reference database in the MS -based proteomics identification process. This database is incomplete as it (partly) lacks information on novel proteoforms (isoforms), single nucleotide variation (SNV), indels (insertions and deletions), and gene fusion products. A more suitable reference database for novel protein identification is constructed containing all ORFs from the translation of the genome in its six reading frames. This strategy makes that all possible protein forms except for peptides spanning the exon junctions are included. That is why these are widely used for prokaryotes by virtue of a small genome and lack of splicing (Baudet et al. 2010). Since 98 % of the human genome is predicted to be non-coding (Lander et al. 2001), this approach would massively increase the search space resulting in an unattractive approach in terms of both computation time and error rate, while also omitting mutations , small open reading frames and non-AUG start sites.

Considering the 6 frame translation approach, only one sixth are true candidates, impairing the statistical validation model used (Hernandez et al. 2014; Blakeley et al. 2012). Furthermore, splice isoforms, single nucleotide variation and indels remain undetectable in a 6-frame translated reference database. A smaller reference database can be constructed from cDNA libraries or expressed sequence tags (EST), ensuring that the corresponding sequences are transcribed as they are derived from RNA (Hernandez et al. 2014). Furthermore, as the reference database has been constructed from RNA, alternative splice proteoforms may be included. Implementing such strategy in human has succeeded to compress the database to 3 % compared to a 6-frame reference database, with minimal sacrifices to the peptide sequence content (Edwards 2007). Another study using the Ensembl (Cunningham et al. 2014) database, including all isoforms, observed a 7 % increase in peptide identification compared to the non-redundant Swiss-Prot database (Fei et al. 2011). Tools as GENQUEST reduce the search space by filtering peptides on their mass and isoelectric point (Sevinsky et al. 2008). Although the aforementioned database choices have proven to be useful, the generated reference database contains sequences on a species wide level, where sample specific genomic (SNVs, indels) and RNA splice variations remain unregistered. Next generation sequencing (NGS) techniques enable the user to capture the transcriptome and/or translatome relatively accurate, fast and cost-efficient, thus enabling sample-specific reference database construction (Bahassi and Stambrook 2014). This review discusses how the integration of NGS techniques with MS -based proteomics enables the identification of novel, small proteins, strongly focusing on ribosome profiling and RNA-Seq . To illustrate the relevance of these techniques in current novel research fields, RNA-Seq mediated neo-antigen discovery and RIBO-Seq empowered micropeptide identification are discussed.

4.2 RNA-Seq

The majority of MS -based proteomic studies consist of comparing the obtained spectra against protein databases of known / predicted proteins, resulting in a high number of unidentified spectra. These unidentified spectra may map to novel peptides absent from the used protein database , represent splice variants, alternative open reading frames (e.g., stop codon read-through, alternative start sites) or genetic variations (Ning and Nesvizhskii 2010). RNA-Seq provides a comprehensive profile of the transcriptome and enables the construction a database reflecting the native transcript composition, including those novel sequences (Woo et al. 2014; Marguerat and Bähler 2010; Wang et al. 2009b). A study performed by Wang et al. (2012) describes a workflow to derive a protein database from RNA-Seq data and records a substantial increase in peptide identifications in comparison to searches against an Ensembl database. Furthermore, RNA-Seq data allowed the detection of peptides containing SNPs associated with cancer. A workflow designed by Sheynkman et al. (2013), establishing a database focusing on splice junctions derived from RNA-Seq, identified unannotated transcript junctions from Jurkat cells. Compared to cDNA and EST libraries, RNA-Seq provides a more advanced and comprehensive methodology to identify novel splice junctions (Sheynkman et al. 2013). Moreover, RNA-Seq enables proteomics studies on non-model organisms with limited genome annotation (Lopez-Casado et al. 2012; Song et al. 2012; Armengaud 2013). Many RNA-Seq datasets are publically available (e.g., in the Sequence Read Archive (Leinonen et al. 2011b) or European Nucleotide Archive (Leinonen et al. 2011a)) and can be utilized in proteogenomics applications. It is advised to pool multiple RNA-Seq experiments cumulatively (Woo et al. 2014) to construct a search space when non-matching proteomics and transcriptomics datasets are used.

4.2.1 Neo-antigens

The immune system recognizes an extensive range of antigens, which are distinguished as either ‘self’ or ‘non-self’ molecules. All human cells present peptide antigens on major histocompatibility complex (MHC) molecules, which interact with T-cell receptors (TCR), present on the plasma membrane of T-cells. When a peptide presented on the MHC is not recognized as ‘self’, this elicits a T-cell response, causing apoptosis or inactivation of the corresponding target cell. The presentation of ‘non-self’ peptide antigens may be induced by various reasons, ranging from viral infection to disturbed homeostasis (Singhal et al. 2013; Attaf et al. 2015). As tumor cells evolve from ordinary cells, they develop distinct characteristics recognizable by the immune system. Hence, the immune system is clearly of great importance in cancer development. The immune system can promote tumor growth by impairing tumor cell immunogenicity or act as a tumor suppressor by destroying or restraining tumor expansion (Koebel et al. 2007; Shankaran et al. 2001; Dunn et al. 2002). Immunotherapy, where T-cell activity is stimulated through the inhibition of the T-cell deactivation pathway (checkpoint blockade (Gubin et al. 2014)), has been shown to be an effective treatment in a variety of human malignancies (Wolchok and Chan 2014; Sharma and Allison 2015). For instance, Rosenberg (Hinrichs and Rosenberg 2014) demonstrated how infusion of tumor-infiltrating lymphocytes can be an effective treatment option in metastatic melanoma and antibody treatment sensitizing T-cell activation improved overall survival of metastatic melanoma patients (Hodi et al. 2010). The ability of T-cells to elicit a T-cell response based on the interaction with MHC molecules on tumor cells indicates the existence of tumor specific epitopes on antigens. These antigens can be derived from native proteins for which T-cell tolerance is incomplete (e.g., tissue / time restricted proteins being expressed) or they can be formed from proteins absent from the human genome (e.g., mutated proteins), called neo-antigens. Neo-epitopes are a product of tumor-specific DNA alterations and thus result in novel protein sequences (Schumacher and Schreiber 2015).

Studies in mouse models indicate that vaccination with neo-antigens increased tumor control in immunotherapy (Gubin et al. 2014; Yadav et al. 2014). However neo-antigen identification is tedious and limitations in MS sensitivity result in a substantial fraction of false negatives. Also, the identification of genomic variations in proteins does not guarantee MHC presentation. Combining transcriptomics sequencing techniques (RNA-Seq) to identify mutated proteins absent in native cells with proteomics identification of MHC presented antigens provides a feasible workflow useable in clinical studies. The global design of this workflow consists of the identification of tumor-specific genomic variation trough RNA-Seq, followed by an optional in silico filtering by algorithms to predict MHC antigen presentation and the construction of a database consisting of possible neo-antigen (Lu et al. 2014; Linnemann et al. 2014; Robbins et al. 2013). Next MS -based proteomics matches the experimentally identified MHC bound antigens against the RNA-Seq derived database, selecting high confidence neo-antigen. Functional essays can be performed to experimentally identify neo-antigens as demonstrated in mouse models, successfully treating cancer (Rizvi et al. 2015; Yadav et al. 2014; Bassani-Sternberg et al. 2015). Figure 4.1 provides a summary of the neo-antigen identification workflow.

Fig. 4.1
figure 1

A simplified neo-antigen identification workflow. Tumor cells are sequenced to identify genomic variations specific to these tumor cells, next a database is generated consisting of neo-antigen candidates. Optionally, in silico algorithms can be used to predict MHC antigen presentation, resulting in a more confident dataset. Next, MS-based proteomics identifies MHC bound antigens followed by functional analysis confirming candidate neo-antigens

4.3 RIBO-Seq

In the late 1960s, the ability of ribosomes to protect mRNA from endonuclease digestion was demonstrated (Steitz 1969). Despite this early discovery, it was not until the advent of NGS and the accompanying bioinformatics toolsets, that genome-wide translatome profiling became attainable. At the end of the twentieth century a technique named polysome profiling emerged (Johannes et al. 1999), yielding large scale analysis of translation. In summary, polysome profiling captures mRNA immobilized on translating ribosomes, separates these polyribosomes (e.g., ultracentrifugation on a sucrose gradient) and subsequently sequences the obtained RNA fragments (Faye et al. 2014). This technique, identifying mRNA with ribosomal occupancy, saw various use-cases throughout the years and is still frequently applied (Piccirillo et al. 2014). However, it was with the advent of RIBO-Seq, enabling massive parallel sequencing of the +/− 30 nt mRNA fragments protected by ribosomes (RPFs), that in-depth assessment of the translatome was empowered (Ingolia et al. 2009, 2012, 2014). The main advantage of RIBO-Seq over polysome profiling is the ability to retrieve positional information obtained from these RPFs with sub-codon resolution, enabling accurate prediction of the ribosome A-site positions. The RIBO-Seq technique diverged into two complementary implementations, capturing either elongating ribosomes or initiating ribosomes. RIBO-Seq of elongating ribosomes is feasible through the addition of antibiotics inhibiting ribosome translocation (e.g., cycloheximide (Ingolia et al. 2009) and emetine (Ingolia et al. 2012)), peptidyl transferase (e.g., chloramphenicol) or by thermal freezing (Oh et al. 2011). Initiating ribosomes, allowing the deduction of translation initiation sites (TIS), is achieved through the addition of initiation blocking antibiotics (e.g., harringtonine (Ingolia et al. 2012) or lactimidomycin (Lee et al. 2012)). Figure 4.2 sketches an overview of RIBO-Seq protocol.

Fig. 4.2
figure 2

A general overview of the RIBO-Seq protocol. First, cell lysates are prepared in conditions accurately reflecting in vivo translation. Secondly, addition of nucleases will digest RNA (nuclease footprinting), however the +/−30 nt mRNA fragments encapsulated by ribosomes are protected from digestion (ribosome footprints). Next, ribosome-footprints are separated from cell lysates followed by purification of ribosome protected RNA. Ligation of single-stranded adaptors enables reverse transcription. Subsequently, first strand reverse transcription products are circularized and transcript products hybridized to rRNA probes are depleted. Finally, PCR amplifies the remaining sequences that are subsequently sequenced. An in depth description of the protocol is provided by Ignolia et al. (2012b)

4.3.1 RIBO-Seq Unravels the Translatome

Although many variations are attributable to changes in gene transcripts, RIBO-Seq likewise reveals pervasive translational regulation (Michel and Baranov 2013). For example, Ignolia et al. (2009b) examined the ability of ribosome profiling to monitor changes in protein synthesis in response to starvation in yeast, observing translation changes in approximately one-third of the genes. Two other studies examining the translatome in response to heatshock (Shalgi et al. 2013) and proteotoxic stress (Liu et al. 2013) revealed interesting properties of the influence of chaperones on elongating ribosomes in response these stresses. In a study performed by Brar et al. (2012), exploring changes in expression during meiosis in yeast by performing RIBO-Seq over stage-specific time points, numerous dynamic events (including translation products of small open reading frames ) were captured, unidentified by other techniques. A study performed by Stern-Ginossar et al. (2012) analyzed gene expression changes of human foreskin fibroblasts during cytomegalovirus infection. Measurements across different time-stamps revealed prominent viral gene translational regulation, where translation varied at least fivefold in 82 % of ORFs.

Furthermore, RIBO-Seq can identify novel translated regions, until now undetectable with other techniques. For instance several 5’-UTR ORFs, associated to a regulatory function (Ingolia et al. 2009, 2011; Brar et al. 2012), have been identified by ribosome profiling . The ORFs in 5’ untranslated regions are difficult to identify due to their specific characteristics: short length, limited coverage, non-AUG initiation, sometimes overlapping with canonical ORFs. Michel et al. (2012) demonstrated that given sufficient ribosome coverage, alternative reading frames are discernible by analyzing the triplet codon periodicity characteristic to translation and observable with the ribosome profiling technique. They reported on 5’-UTR ORFs with higher RPF intensity than the main canonical downstream ORF. In many cases these upstream ORFs (uORFs) partly overlapped with the canonical ORF. Furthermore Michel et al. identified frame transitions in translation, confirming well-known cases of frame shifts in humans. In a study performed by Gerashchenko et al. (2012) in yeast, four novel frame shift events were identified that correlated to oxidative stress. Also, the start site determination with the ribosome profiling technique enables the identification of ORFs with non-AUG start sites, resulting in numerous identified near-cognate initiation sites. Wan and Qian (2014) developed a database containing alterative translation initiation sites and their associated ORF identified by RIBO-Seq. Ribosomal activity was also observed in non-coding regions, revealing putative novel protein coding regions (Ingolia et al. 2012; Lee et al. 2012).

4.3.2 RIBO-Seq, a Bridge Between RNA-Seq and Proteomics

Protein inference from transcript abundance assumes constant RNA stability as well as stable translation rates. This assumption is erroneous as RNA stability can be highly variable and translation rates are volatile across transcripts. RIBO-Seq bridges the gap between RNA-Seq and proteomics by providing translational information, enabling improved inference from the transcriptome to the proteome and vice versa. RIBO-Seq is capable of detecting coding transcripts, but no direct evidence is provided whether these translated sequences ultimately yield stable protein products. Ribosomal occupancy could yield regulatory functions, but couls also point to unstable protein products or noise (Ingolia et al. 2014; Guttman and Rinn 2012). Several in silico tools and metrics were devised to predict the coding potential of ORFs (based on ribosome protected fragment length (Ingolia et al. 2014), triplet periodicity (Bazzini et al. 2014) and conservation (Lin et al. 2011)). However, MS -based validation remains a crucial confirmation technique in most cases. In turn, MS-based proteomics requires a database consisting of sample specific protein sequences. RIBO-Seq assisted database generation has several advantages over RNA-Seq generated databases. Novel proteoforms can be identified thus optimizing the search space (Calviello et al. 2015; Menschaert et al. 2013; Van Damme et al. 2014; Koch et al. 2014). This approach has been used by Fritsch et al. (2012) to identify 546 N-terminal protein extension in human, Menschaert et al. (2013) observed a 2.5 % increase in the overall protein identification rate using this approach. In a recent study performed by Fields et al. (2015), 1990 protein isoforms, 696 truncations, 341 extension and 1379 upstream ORFs were identified by RIBO-Seq. Automated pipelines facilitating RIBO-Seq integration in MS -based experiments, such as PROTEOFORMER (Crappé et al. 2014a), are readily available and easy to implement. Moreover Xie et al. (2015) developed an online database to query, analyze, visualize and download RIBO-Seq dataset s.

4.4 Micropeptides

Micropeptides are defined as functional translation products originating from small open reading frames (sORFs) . No consensus was reached regarding the sORF size and some studies consider an upper threshold of 200–250 codons (Hayden and Bosco 2008; Yang et al. 2011). However, the most widespread sORF size limit is 100 codons, a rule that we endorse here. A pioneering genome-wide study in 2003 on yeast suggested the functional importance of sORFs (Kessler et al. 2003), describing functionally conserved sORFs discovered by means of cross-species BLAST analysis. Only a few years later, Savard et al. (2006) identified mille-pattes in the red flour beetle by means of EST screening, a polycistronic peptide encoding four sORFs regulating HOX-genes. Kondo et al. (2007) and Galindo et al. (2007) examined mille-pattes analogs in Drosophila melanogaster resulting in the discovery of the tarsal-less (tal) and polished rice (pri) genes, respectively. This polycistronic mRNA, previously categorized as being non-coding, apparently was miss-annotated based on the ORFs size (Tupy et al. 2005). At the moment of writing, the tal and pri translation products are among the best characterized examples of micropeptides, regulating embryonic development throughout numerous insect species (Chanut-Delalande et al. 2014). The discovery of these tal and pri genes, together with the advent of ribosome profiling , boosted the research into sORF-encoded micropeptides. Several different research groups reported on the discovery of putatively coding sORFs using various techniques, pointing to novel functional micropeptides (Saghatelian and Couso 2015; Chu et al. 2015; Bazzini et al. 2014; Magny et al. 2013; Slavoff et al. 2013; Tonkin and Rosenthal 2015; Crappé et al. 2013; Pauli et al. 2014). Toddler, for example, is an embryonic signal that promotes cell movement (Pauli et al. 2014), Myoregulin regulates Ca2+ handling in muscle cells (Magny et al. 2013) and Sarcolipin regulates muscle-based thermogenesis in mammals (Tonkin and Rosenthal 2015). This is a relatively new research field (Crappé et al. 2014b; Andrews and Rothnagel 2014; Albuquerque et al. 2015), where the results of many in silico based studies and proteogenomics endeavors need further experimental validation.

4.4.1 In Silico Micropeptide Identification

Automated gene annotation systems correctly identify the majority of verified protein coding ORFs based on recognizable genomic sequence characteristics (e.g., canonical initiation codons, splice sites, promoter sequences) (Sleator 2010). Most gene annotation algorithms set a lower threshold of 100 base triplets to exclude false positive annotations (Carninci et al. 2005; Frith et al. 2006a, b; Dinger et al. 2008). Recently, studies suggest that applying this lower threshold precludes the identification of numerous small proteins (Pauli et al. 2014; Bazzini et al. 2014; Ma et al. 2014; Frith et al. 2006a, b; Chng et al. 2013; Galindo et al. 2007; Crappé et al. 2013). Some computational approaches have been developed, such as uPEPperoni (Skarshewski et al. 2014) and sORF finder (Hanada et al. 2009), providing in silico assessment of putatively coding sORFs , based on phylogenetic conservation. While the identification of sORFs is relatively straightforward, it does require a start and stop codon separated by at most 98 codons, the discrimination of coding vs. non-coding sORFs of this excessive pool of sORFs has proved to be more difficult. Due to their small size, many sORFs lacking any coding potential occur by chance. Cross-species conservation can be used as a proxy to function, but solely relying on phylogenetic conservation could prevent the identification of biologically relevant species-specific sORFs (Clamp et al. 2007). PhyloCSF (Lin et al. 2011) models phylogenetic relations between species by analyzing conservation at the amino acid level, rather than the nucleotide level and is most regularly used for small open reading frame assessment. It outperforms other methodologies (Reading Frame Conservation metrics, the regular CSF method or a d n /d s test) and is capable of identifying micropeptide coding sORFs as short as 13 amino acids (Guttman and Rinn 2012). Using mainly conservation as a criterion, Mackowiak et al. (2015) identified numerous conserved sORFs in different species (831 in H. sapiens, 350 in M. musculus, 211 in D. rerio, 194 in D. melanogaster, and 416 in C. elegans), some of which have been described and characterized previously.

4.4.2 RIBO-Seq Enables the Identification of Translated sORFs

RNA-based transcriptomics is ignorant to ORF delineation; therefore most studies rely on conservation and pattern recognition for sORF identification. A recent study in yeast identified several micropeptides, one of which was also functionally characterized in influencing osmotic stress. The technique was based on using a 6-frame translation database derived from RNA-Seq data as a search space for subsequent MS fragmentation spectra matching (Yagoub et al. 2015). However, RNA-Seq does not indicate translation of the sORFs as opposed to RIBO-Seq. On top of pinpointing translated mRNA regions, RIBO-Seq can also reveal TIS, enabling the detection of non-AUG sORFs. In silico detection of non-AUG sORFs is laborious and difficult, since the search space becomes extensively larger, but from previous RIBO-Seq studies it has become clear that non-canonical start codons are more common than previously expected (Ingolia et al. 2011). Also, Slavoff et al. (2013) identified translation products from sORFs having non-AUG start sites using an MS -based proteogenomics approach. Recently, Fields et al. (2015) used a regression method on ribosome profiling data to identify sORFs that demonstrate an RPF length pattern and resemble that of annotated protein-coding ORFs. They discovered numerous sORFs, of which a subset shows very weak sequence conservation.

sORFs can be located in coding sequences (CDS), in 5’-untranslated regions (5’-UTR), in 3’-untranslated regions (3’-UTR), in intergenic regions (in-between genes) or in non-coding RNA regions. A first proof of 5’-UTR sORFs being translated was observed by Crowe et al. (2006). They revealed that 20 % of human 5’-UTR ORFs have TIS in an optimal Kozak sequence context, competent of ribosomal recognition. Follow-up studies revealed approximately 6750 conserved upstream TIS in mice (Lee et al. 2012) and approximately 3000 novel 5-UTR sORFs in human (Fritsch et al. 2012). A few 5’-UTR sORFs were identified encoding micropeptides (e.g., MKKS in human (Akimoto et al. 2013), CPA1 in yeast (Werner et al. 1987)) with regulatory functions. Jorgenson (Jorgensen and Dorantes-Acosta 2012) claimed that 5’-UTR sORFs can regulate the downstream translation of the canonical ORF (also called the peptoswitch mechanism) as exemplified by CPA1. The discovery of dually coding transcripts (transcripts where more than one overlapping ORF can be translated), enabled the discovery of CDS-overlapping sORFs (e.g., CASP1 (Ronsin et al. 1999) and altPrP (Vanderperre et al. 2011) in human). Most 3’-UTR sORFs are considered non-coding and are confirmed by the RIBO-Seq profiles that closely resemble those of non-coding ORFs. Still, a limited set of 3’-UTR sORFs was identified by MS -based techniques (e.g., Bazzini et al. (2014) identified ten 3’-UTR sORFs using MS in combination with RIBO-Seq in a proteogenomics approach). Both sORFs in intergenic as well as in non-coding regions have been observed with RIBO-Seq (Lee et al. 2012). In particular, ribosomal activity on long non-coding RNA (lncRNA) fuelled a debate in the scientific community (Pauli et al. 2015) on whether or not lncRNAs are truly non-coding (Ruiz-Orera et al. 2014; Smith et al. 2014). Figure 4.3 provides an overview of sORFs identified in different (annotated) genomic region s.

Fig. 4.3
figure 3

sORFs classification. sORFs can be classified according to their genomic location, here an overview is provided of the different sORF classifications

4.4.3 Multi-omics Integration Is Still Indispensable

Ribosome occupancy does not necessarily mean translation into functional protein products; furthermore, RIBO-Seq is susceptible to noise. Besides conservation, several tools and metrics were developed to distinguish coding from non-coding sORFs . For example Ignolia et al. (2014) observed that the ribosome protected fragment (RPF) length distribution differs significantly between truly coding and non-coding ORFs and developed the FLOSS-score to distinguish between both categories (Fig. 4.4). Bazzini et al. (2014) developed the ORFscore, which calculates the preference of RPFs to accumulate in the first frame of coding sequences (Fig. 4.5), making full use of the triplet periodicity in the RIBO-Seq signal. The Ribosome Release Score (RRS) examines the release of translating ribosomes after hitting a stop codon (Guttman and Rinn 2012) (Fig. 4.4). More complex statistical methods are based on learning algorithms such as Coding Potential calculator (Kong et al. 2007), CRITICA (Badger and Olsen 1999), CSTMiner (Castrignanò et al. 2004) and the recently described ORF-RATER (Fields et al. 2015) and RiboTaper (Calviello et al. 2015). ORF-RATER, a regression based translating ORF identifier based on RIBO-Seq data, discovered numerous novel ORFs, including sORFs with MS -evidence (Fields et al. 2015). Likewise, RiboTaper exploits a statistical approach to identify translated ORFs based on the nucleotide periodicity of RIBO-Seq data and correctly identified annotated protein coding sORFs , such as the aforementioned Toddler sORF (Calviello et al. 2015). However, in the novel field of micropeptide discovery, MS -based identification still remains indispensable. A proteogenomics approach generating a database of putatively coding sORFs derived from RIBO-Seq (or RNA-Seq ) information, followed by MS-based proteomics identification creates an ideal setting for sORF discovery. Numerous sORFs have been identified using this approach (Ma et al. 2014; Bazzini et al. 2014; Mackowiak et al. 2015). A public database for sORFs (http://www.sorfs.org) exists, gathering multi-omics (RIBO-Seq and MS ) evidence and in silico metrics. The resource currently harbors 266,342 sORFs across three model species (human, mouse, fruit fly) (Olexiouk et al. 2015), but will expand in the near future, with more data on other organism and cell types and including the latest “coding potential” metrics. Figure 4.4 provides an overview of the micropeptide identification workflow .

Fig. 4.4
figure 4

Overview of coding potential assessment methods based on RIBO-seq . The FLOSS score compares the RPF-length distribution of sORFs with the RPF-length distribution of canonical protein-coding transcripts; strong disagreement between the two RPF-length distributions indicates non-coding behavior. The ORFscore calculates the preference of RPFs of coding ORFs to accumulate in the first frame of the coding sequence and the RRS provides a score based on the tendency of ribosome to dissociate from RNA after hitting a stop coding in coding ORFs

Fig. 4.5
figure 5

A simplified micropeptide identification workflow. First, translating sORFs are identified using RIBO-seq . Next, candidate protein coding sORFs are predicting using methods described in the “Multi-omics integration is still indispensable” section and a database of translated sORFs is generated for proteomics identification. Results from both pathways can be combined in order to select micropeptides for functional analysis

4.5 Conclusion and Future Perspectives

A multi-omics identification workflow for translation products is certainly advantageous, and is indispensable for novel (small) proteoform identifications. Such a proteogenomics approach is in many cases sample specific, enabling the analysis of sample specific variations. In cancer research , where variations obtained in a single cell may result in tumorous behavior and where these variations are frequently distinct between different tumor types, capturing such sample specific variations is crucial. Identification of neo-antigens in essence holds the identification of sample specific variation, obtainable by transcriptome sequencing technologies. However MS -based proteomics identification remains essential in order to perceive whether these transcript changes yield non-synonymous peptide variations. While still in its infancy, neo-antigen research increases the overall understanding of the immune system and moreover holds important therapeutic value.

The RIBO-Seq enabled genome-wide assessment of translation (translatomics) bridges two omics fields: transcriptomics and proteomics . Genome wide analysis of this ribosome profiling information already resulted in the identification of numerous sORFs with coding potential, questioning the non-coding character of sORFs. Follow-up analyses observed sORFs that resemble canonical coding ORFs and some are in the mean fully characterized as being coding. Over the last years, various tools and metrics were devised to assess the coding potential of sORFs (both conservation and sequence based). Also, workflows aiding the integration of RIBO-Seq information and MS -based proteomics are becoming available, e.g., PROTEOFORMER (Crappé et al. 2014a). The scientific community is becoming aware of sORFs as potentially protein coding units. As a result, public sORF databases, such as http://www.sorfs.org, will be highly useful in the experimental design of future experiments (Olexiouk et al. 2015). Moreover, already conducted experiments (with an emphasis on MS -based proteomics studies) must be reprocessed to account for micropeptides . The scientific community is becoming aware of the large amount of publically available proteomics data accumulated over the past years that is currently being left untouched, while our scientific knowledge and technology evolved tremendously (Vaudel et al. 2015a, b; Verheggen et al. 2015). The sORFs .org database already holds a pilot study where 1172 publically available MS datasets from PRIDE were reprocessed, providing MS-evidence for more than 5000 micropeptides. Cumulative evidence that sORFs are able to encode functional micropeptides has been gathered, but their exact biological relevance often remains to be determined. Undoubtedly, future research on overexpression or knock-down will reveal more about the functional roles of specific sORF-encoded micropeptides.

4.6 Funding

Postdoctoral Fellows of the Research Foundation – Flanders (FWO-Vlaanderen) [G.M.,12A7813N]. Research Foundation – Flanders (FWO-Vlaanderen) [V.O, G0D3114N].