1 Proteomics in Oil Palm Research

Rising number in the world population from 6.6 to 7.6 billion since the past decade has escalated the demand for food. In the context of palm oil, the market has also surged since it provides essential ingredients for food and non-food applications. To address this critical issue, the unequivocal solution is by increasing the yield of palm oil per unit area of land (from the current average production of 3 tons of oil per hectare per year to projected production of 18.5 tons of oil per hectare per year) [1]. To achieve this, the palm oil industry need to adopt policies for oil palm to be grown sustainably to obviate negative environmental impacts. Therefore, it is inevitable that we must address the current issue by improving the yield in existing plantations to increase global production of palm oil [2]. To meet these challenges, ‘omics’ technologies have been practiced in the oil palm research to understand the yield-limiting and yield-reducing factors in the effort to enhance the yield and quality of palm oil [1]. However, to this respect, tremendous advances have been made over past years, particularly in oil palm genomics [3,4,5,6,7,8,9,10]. Transcriptomics information generated are simply not sufficient and biologically comprehensive enough to reveal the actual state of the plant system biology at any specific stage and condition. Therefore, proteomics is becoming an increasingly important tool for various applications in the oil palm research to simultaneously explain the oil palm biological processes of interest, such as yield and oil quality, the oil palm growth and development and its natural responses towards environmental stresses like diseases (Fig. 1).

Fig. 1
figure 1

Current and potential applications using the proteomics technologies

The first step to integrate the proteomics approach into the oil palm research is to generate reference proteome maps for different oil palm species and fruit developmental stages. For that purpose, Lau, Hassan, Daim and co-workers had developed protein extraction protocols tailored for the fruit mesocarp, leaf and root tissues [11,12,13]. Together with the transcript sequences for Elaies guineensis and Elaies oleifera [7], the proteomics technologies are used to improve the current understanding of the palm oil biosynthesis machinery in the search for potential protein biomarkers for yield and oil quality. Difference gel electrophoresis (DIGE) analysis was used in several studies to determine the control mechanisms and key proteins that distinguish different agronomic traits related to lipid biosynthesis. For instance, Ooi and co-workers detected 41 unique and differentially expressed proteins in the fruit mesocarp of high- and low-yielding oil palms that were not related to lipid biosynthesis [14]. The findings showed that regulation of lipid biosynthesis involved a myriad of metabolic pathways. The findings were strongly supported by Lau et al. [15, 16] using shotgun quantitative proteomics technologies. They also discovered several differentially accumulated proteins from other metabolic pathways such as glycolysis that would have contributed to the regulation of fatty acid biosynthesis. In a post-translational modification (PTM) study done by Lau and co-workers, selected reaction monitoring (SRM) using targeted peptide fragments corresponded to fatty acid proteins, revealed that phosphorylation could rationalize the differences in the oleic acid content of the studied oil palm species [17].

Protein changes during somatic embryogenesis, which involve cell differentiation, has been studied using gel-based proteomics to understand molecular events of plant embryo development in vitro [18]. These differentially expressed proteins were identified in the early stage of embryogenesis and they were also stage-specific. Gel-based quantitative proteomics was applied in another study to understand the biological mechanism for the low level of embryogenesis [19]. They discovered three proteins, namely triosephosphate isomerase, l-ascorbate peroxidase and superoxide dismutase as potential protein biomarkers at both protein and transcript levels, respectively.

Proteomics technologies have been actively used to investigate the interaction between oil palm and the pathogenic fungi, Ganoderma boninense. Using gel-based proteomics approach, several studies were carried out to determine differentially expressed proteins upon infection of the oil palm root system with G. boninense. An optimized protein extraction using phenol/ammonium acetate in methanol was first developed by Al-Obaidi and co-workers to analyze the protein profile of Ganoderma species [20]. Al-Obaidi and co-workers had also identified 21 proteins from healthy and G. boninense-infected roots that showed differences in their expression profiles [21]. Protein profiling of the infected root proteins at different time points had indicated that 12 proteins were differentially expressed, 7 days after infection with G. boninense [22]. Further comparison between pathogenic and non-pathogenic Ganoderma species revealed 24 differentially expressed proteins that corresponded to the Ganoderma species inoculations [23]. These proteins could explain the disease susceptibility of the oil palm. In the effort to develop early detection method for G. boninense, proteins in the leaf tissues of oil palm were profiled using a combination of gel electrophoresis and shotgun proteomics [24]. Majority of the 51 proteins identified were involved in photosynthesis, carbohydrate metabolism, immunity and defense. Table 1 summarized all the proteins that had been indicated as differentially expressed throughout the various stages of palm oil production and during the Ganoderma infection.

Table 1 Differentially expressed proteins that have possible involvement in palm oil production and Ganoderma disease defence

2 Challenges in Plant-Based Proteomics

A complete proteomics pipeline is made up from several stages, starting from protein extraction, digestion, separation and quantitation of peptides, and protein identification with mass spectrometry. Proteome of a cell or tissue at any given time is highly dynamic and complex. Combination of different approaches have been employed to study subgroup of proteins and entire proteomes largely because of the proteome complexity and wide dynamic range. Protein and peptide separation techniques have been essentially being scrutinized using two complementary approaches; gel-based and non-gel based (or gel-free or shotgun). These approaches vary in terms of the peptide generation, separation and detection. Ultimately, each of these approaches only covers specific protein subgroups but not the whole proteome. The gel-based approach is the cornerstone of proteomic analysis. Gel electrophoresis is a powerful technique to separate complex protein mixtures to yield qualitative and quantitative high resolution snapshots of intact proteins (two-dimensional) and polypeptides (one-dimensional), resulting in a quick overview of protein isoform varieties and detection of any post-translation modification.

The limitations of the gel-based approach, however, include the inability of the gel technique to resolve hydrophobic proteins or proteins with extreme sizes and isoelectric points. Thus, mass spectrometric analysis of these proteins is often not optimal unless further proteomics tools are employed. Gel-free or shotgun proteomics has been developed continuously mainly because of the need to reduce technical variations for high-throughput workflow that is not achievable using gel electrophoresis. In the gel-free or shotgun approach, liquid chromatography is coupled to an ionization source, which is typically the nanoelectrospray for peptides. Peptide ion separation will occur using only reversed-phase column or with combination of different columns for high resolution separation. Fragmentation of peptide ions occur normally through collision with gas molecules and the enormous amount of data from the acquired tandem mass spectra are then used for protein database searches using protein search engines such as Mascot for the protein identities.

One of the core challenges in plant proteomics is the low protein concentration specially in circumstances where the amount of tissues are limited, as cell wall and vacuole make up most of the cell mass. Only 1–2% of the total cell volume make up the cytosol, which is the center of majority cellular processes [25]. Specialized procedures are essential to induce plant cell wall disruption to release the proteins. In addition, plant extracts also contain numerous non-proteinaceous compounds such as polyphenols, pigments, polysaccharides, nucleic acids and lipids. Several major crops had been reported to contain high amounts of compounds that interfere with downstream proteomics analysis. For instance, oxidative enzymes (polyphenol oxidase), phenolic compounds, latex and carbohydrates are abundant in banana, Musa spp. and stalk tissues [26,27,28]. These interfering compounds are co-purified with the precipitated proteins, rendering them difficult to solubilize. Solubilization of proteins is decisive in order to resolve them for further downstream analysis using techniques such as Western blot and mass spectrometry.

The existence of a high dynamic range of protein abundances in plant tissues confers additional complications to the protein analyses. For example, 40% of the total protein content of green tissues consists of ribulose-1,5-bisphosphate carboxylase oxygenase or RuBisCo [29] while storage proteins are the most abundant proteins in seeds [25]. The presence of those highly abundant proteins complicates the detection of low abundance proteins by means of protein electrophoresis and mass spectrometry. Normally, these low abundance proteins such as the regulatory proteins are the proteins that we are interested in [30]. Various fractionation techniques have been developed to deal with this wide dynamic range, which can be up to 12 orders of magnitude [31]. They are generally divided into electrophoresis- or chromatography-based fractionations to separate a subset of proteins [32]. For instance, isoelectric focusing that exploits the charge differences of proteins has been utilized to fractionate proteins to capture the less abundant proteins [33, 34]. Other approaches are based on the principle of affinity chromatography such as ATP and metal affinity [35], hydroxyapatite affinity- [36] and immobilized metal affinity-chromatography [37]. The latter technique is extremely useful in the enrichment of phosphorylated proteins in phosphoproteomics study.

Palm oil, which derived from the fruit mesocarp, can comprise up to 90% of the dry weight of oil in the fruit mesocarps [14], apart from the plant-based interfering compounds mentioned earlier. Therefore, preparation of proteins from oil palm requires labor intensive workflow in order to be compatible with downstream proteomics analysis. Comprehensive protocols in the preparation of oil palm proteins from fruit mesocarps, young and mature leaves; and roots for gel-based and gel-free proteomics analysis have been developed in recent years. Lau and co-workers had described an approach employing different solvent systems to remove the excessive oils from the oil palm fruit mesocarps prior to protein extraction with phenol [11]. Phenol extraction was also employed in works done by Daim, Al-Obaidi, Silva and co-workers to investigate proteins from the oil palm leaf [13], Ganoderma-infected root [21], callus, embryos and explants [18] using gel electrophoresis-based mass spectrometry analysis. In addition, Daim and co-workers first precipitated the extracted leaf proteins with trichloroacetic acid/acetone before the phenol extraction to improve the number of gel spots and quality [24]. The high quality of proteins extracted with phenol in both works may be attributed to the fact that phenol precipitates only proteins (in phenol phase) and contaminants such as polysaccharides, polyphenol and carbohydrates are removed in the organic phase. In reducing the leaf protein complexity and quality for two-dimensional gel electrophoresis, Tan and co-workers had applied both trichloroacetic acid/acetone precipitation and polyethylene glycol fractionation [38]. Trichloroacetic acid followed by acetone precipitation was also successfully applied in the extraction of proteins from oil palm root for gel-free mass spectrometric analysis [12].

3 Functional Proteomics Analysis

Early developments in quantitative proteomics were propelled by studies on yeast and mammalian cell lines [39]. Quantitative changes are needed to elucidate changes in protein expression. Staining of proteins on gels with specific stains such as Coomassie is routine to determine their intensities but often than not, this approach is tedious and error-prone. The intensities of the liquid chromatography peak detected using ultraviolet–visible spectrophotometric detector are usually not proportionate to the amount of proteins in a given sample. The reason is that different types of protein absorb different ultraviolet wavelengths which give the chromatograms. For examples, aromatic amino acids (tryptophan, tyrosine and cysteine) absorb ultraviolet wavelength at 280 nm while peptide backbone absorbs ultraviolet wavelength at 215–235 nm. Thus, stable isotope approaches were introduced into mass spectrometry-based proteomics to allow a more accurate and reliable determination of relative variations in peptide abundances. There are several strategies used today in quantitative proteomics and all of these methods have their advantages and disadvantages. Commonly, investigations using quantitative proteome analysis approach rely on non-mass spectrometry-based quantitation techniques [25] such as DIGE and mass spectrometry-based quantitation techniques [40].

Determination of protein ratios using gel-based methods has the potential to be erroneous because of gel-to-gel inconsistencies of the separated protein profiles. DIGE addresses these difficulties and that explain the reason for this technique to be most commonly employed in non-mass spectrometry-based proteomics quantitation, as well as quantitative analysis in plant proteomes. In DIGE, up to two different protein samples and a reference standard (containing equal amounts of both protein samples) are labelled with fluorescent dyes such as CyDyes (Cy3, Cy5, Cy2). The two protein samples are then pooled prior to separation with gel electrophoresis [41, 42]. Protein ratios between the two different samples are calculated by measuring the fluorescence for each protein spot and thus revealing the quantitative data for protein isoforms or differentially regulated proteins [43]. DIGE has been commonly used in plant proteomics studies. For example, in investigations of elicitation effects in plant symbiotes and plant pathogen interactions [44,45,46,47] as well as studies on environmental stresses [48,49,50]. Gomez and co-workers [51] also demonstrated that DIGE coupled to MALDI-TOF analysis could be used to identify differentially expressed proteins in organisms lacking assembled genomes. Ooi and co-workers had also applied DIGE technique to determine 41 unique differentially accumulated proteins in the oil palm fruit mesocarps at critical oil production stages [14]. Pro-Q Diamond, a fluorescent dye that binds to the phosphate moiety of phosphorylated proteins has also been successfully employed to specifically label and quantify phosphorylated protein isoforms in plant [52,53,54,55,56,57].

The mass spectrometry-based quantitative methods include both label free quantitation [58, 59] and chemical isotope labelling [60, 61]. Mass spectrometry signals from different liquid chromatography runs are known to be inconsistent due to technical variations for instance, and therefore generate significant error in quantitative proteomics studies. Despite that fact, label-free methods involving liquid chromatography is becoming increasingly prevalent as it circumvents the need for costly protein labelling and is generally suitable for all types of organisms as well as most workflows [43, 62]. Label-free quantitation compares the chromatographic peak areas of extracted ions. Extracted ion chromatograms exploit the additional separation dimensions for higher confidence in the quantitative signals instead of simply comparing the mass spectrometry signals between different analytical liquid chromatography runs. In principal, peptide areas are aligned according to their mass to charge ratio (m/z) and elution time tags in several liquid chromatography runs. The chromatographic peaks are then integrated with peak integration software such as Xalign [63] and Msalign [64]. In order to be able to do that, the liquid chromatography runs must be reproducible, which sometimes can be a challenging task. Reiland and co-workers had used this approach to determine the dynamic regulation of protein phosphorylation in Arabidopsis [65].

Spectral counting is an alternative approach that is practical, label-free and measures protein abundance in a semi-quantitative manner [66]. Conversely, this method does not integrate chromatographic peaks nor align the retention time of peptides [67] although it agrees with Extracted Ion Chromatogram peak area measurements [68]. Instead, statistical tools such as G-test and t-test are used to count the total number of tandem mass spectra identified for all the peptides from a particular protein to generate the quantitation data [43]. While this method is reproducible, it requires many biological and technical replicates for each sample analyzed. This can be difficult when several experimental conditions and/or time points are analyzed. This approach had been successfully employed to quantify proteins in several studies in plant systems [69,70,71,72].

Usage of differential labelling techniques could circumvent these limitations in label-free quantitation engaging liquid chromatography. These approaches rely on the assumption that both labelled and unlabeled peptides exhibit the same chromatographic and ionization properties but are distinguishable by a mass-shift signature [67]. Specific isotope labelled amino acids (13C or 15N) [73] in the metabolic protein labelling technique known as stable isotope labelling (SIL) with amino acids in cell culture (SILAC), and chemical labels such as in isobaric tags for relative and absolute quantitation (iTRAQ) have been used to quantitate changes in plant proteomes [74, 75]. Labelling methods used in relative quantitation proteomics studies are classed into two categories depending on whether the labels are tagged directly to the peptides or not directly.

Isotope-coded affinity tag (ICAT) was one of the first differential isotope labelling containing a specific chemical reactive group which bound specifically to cysteinyl residues, an isotope mass tag with light or heavy isotopes and a biotin tag for affinity purification [76]. Peptide pairs with 8 kDa mass-shifts are detected during mass spectrometry scans and their ion intensities are relatively compared for quantitation. As tagged cysteine-containing peptides are purified by affinity chromatography, the sample complexity is reduced. However, the obvious disadvantage is that only cysteine-containing peptides are captured by the affinity column. Thus, this impaired the identification and quantitation of proteins with more than one significant peptide as about one in seven proteins do not contain cysteine [67]. A study by Majeran and co-workers had revealed that non-MS-based (2-DE), ICAT and label-free quantitative techniques are complementary [77]. ICAT had been utilized to determine the localization of Arabidopsis thaliana organelle proteins [78]. In addition, since ICAT labels specifically to thiol groups, this method has been widely used to study the redox-status of proteins in plants [79, 80].

iTRAQ was developed at first for peptide level labelling [67]. The different between iTRAQ and ICAT is that in ICAT, tagged proteins from different samples are pooled before trypsinization to eliminate vial-to-vial variations. In iTRAQ, the chemical tags label the peptides instead. The iTRAQ isobaric tags have slight differences in their molecular structures and thus generate various fragment ions (also known as reporter ions) in tandem mass spectrometry scans. The overall molecule mass is kept constant at 145 Da (iTRAQ-4plex) and 304 Da (iTRAQ-8plex) by the presence of a mass balance group (carbonyl).

The iTRAQ reagents label the peptide N-terminals and amino groups of lysine side chains. The advantage is that iTRAQ approach allows comparison of four (iTRAQ-4plex) to eight samples (iTRAQ-8plex) in a single experiment. Relative quantitation is ascertained only after peptide fragmentations in MS/MS scans by measuring the intensity of the reporter ions in the mass region of m/z 114–118 and m/z 114–121, for 4plex and 8plex, respectively [43, 67]. iTRAQ method is able to give accurate quantitation spanning two orders of magnitude for low-complexity samples. However, peptide co-fragmentation happens when two or more closely spaced peptides in MS/MS are selected instead of the single peptide [81, 82]. With a high accuracy mass spectrometer, peptide co-fragmentation effect could be reduced. Tandem mass tags (TMT) are another widely used isobaric tags to label peptides for relative protein quantitation proteomics. As with iTRAQ, the tags share identical chemical structure but have stable isotopes, 13C and 15N incorporated in different combinations in the mass reporter region. The chemical structure of TMTs enable the introduction of five heavy isotopes in the reporter and balancer groups to generate six isobaric tags. Fragmentation of each of the six tags (of a TMT 6-plex, for instant) gives reporter ions at m/z 126, 127, 128, 129, 130 and 131. TMTs react with free amino-terminus peptides and epsilon-amino functions of lysine residues [83].

iTRAQ and TMT reagents have been successfully employed in several quantitative plant proteomics studies. Plant responses towards pathogens had been investigated using this approach [84,85,86,87] as well as the signaling role played by trimeric G proteins in plants [88, 89]. Other studies utilized the iTRAQ to investigate the proteomes of grape berries [90] and oil palm mesocarp at different stages of ripening [16]. Quantitative shotgun proteomics using the iTRAQ was also employed to characterize the changes in the Arabidopsis phosphoproteome during the Pseudomonas syringae pv. tomato DC3000 infection [74]. Meanwhile, TMTs have been used mainly in stress-related studies. The TMT quantitative proteomics was used to discover the up- and down-regulation of 63 proteins and 39 proteins, respectively that involved in rice (Oryza sativa) cold-responsive pathway. In another study using the TMT tags, significantly differentially expressed proteins were found in the rice shoot after root chilling treatment, which include abscisic acid-responsive and drought-associated proteins. Liu and co-workers had also reported 22 up-regulated proteins involved in the antioxidant defense pathway, cell wall polysaccharide remodeling and cell metabolism process, in response to copper (Cu) stress in cell wall of Elsholtzia splendens [91]. Proteome-wide iTRAQ analysis has recently been employed in oil palm studies to reveal differentially expressed proteins involved in important metabolic processes such as fatty acid biosynthesis throughout different fruit developing stages [15, 16]. Table 2 listed some of the key advantages and disadvantages of the proteomics techniques commonly used in crop proteomics research.

Table 2 Comparison of the advantages and disadvantages of the common proteomics strategies used for oil palm and major crop research

4 Data Mining

Model plants are customarily used to investigate the physiological processes of cells, tissues, organelles or whole organisms. Simplicity of study design, biological relevance and economics has an important impact on the plant models employed [25]. The green plants or Viridiplantae have only 100 species of completed and publicly available plant genomes in 2014 according to CoGepedia (http://genomevolution.org) and plaBi (http://plabipd.de/) [107]. Presently, there are over 369,000 known species of flowering plants but the model plants only represent about 0.1% of the known species. The existing plant genomes are the classical A. thaliana (thale cress), economically important crop plants such as Glycine max (soybean), Hordeum vulgare (barley), Medicago truncatula (barrel medic), Populus trichocarpa (poplar), Vitis vinifera (wine grape), O. sativa (rice), Sorghum bicolor (sorghum) and Zea mays (maize), as well as other plants like the Brachypodium distachyon (purple false brome) [108]. However, none of these plant genomes are completely annotated [40] as the genome annotation tools remain decidedly lacking [109]. Moreover, only O. sativa, S. bicolor, Z. mays and B. distachyon are monocotyledons while the rest of the organisms are dicotyledons. Given that oil palm species is a monocotyledon, from the technical outlook, those plants mentioned earlier are unlikely to be suitable as a model organism in any proteomics study.

Complete genome sequences also form the foundation for comprehensive system biology studies by providing the potential of a complete parts list of protein and RNAs of the studied organism [109]. Encouragingly, a comprehensive genome sequencing project led by the Malaysian Palm Oil Board and St. Louis based Orion Genomics, USA for the two key oil palm species, E. oleifera and E. guineensis was completed in 2010. A total of nearly 35,000 genes were predicted from assembled sequences and transcriptome data of 30 tissue types [7]. Uthaipaisanwong and co-workers had also successfully characterized the oil palm chloroplast genome sequence [110]. There are 41,887 non-redundant partial sequences of E. guineensis proteins currently available in NCBI protein database (as of 16th November, 2017). This information can significantly support the oil palm proteomics study.

Protein identification and characterization with mass spectrometry efforts could also be significantly ameliorated with the availability of expressed sequence tag (EST) sequences [111]. A large collection of 37,743 E. guineensis ESTs had been deposited in the NCBI database. The EST databases are indispensable as those sequence tags can be translated into the six reading frames to identify proteins (homology based) using appropriate software. Nonetheless, the size and quality of EST databases have profound effects on the outcome of the protein identification. The usual limitations of EST databases are that more often than not, most proteins are either not or poorly denoted by short EST sequences that only partly cover the whole protein sequence. Bases misread, insertion or deletion errors during sequencing of ESTs can lead to high error rates (about 0.3%) in EST sequences, thus reducing the accuracy of peptide matching [112, 113]. Successful protein identifications with only peptide mass fingerprints employing EST databases are not feasible due to the limitations with EST databases. In addition, EST sequences are rarely sufficient in providing significant protein coverage and a satisfactory number of matching peptides [114].

Reliance on complete plant genomes can be lessened as annotating the biological function of proteins can be facilitated by a homology based approach. According to Carpentier et al. [40] and Remmerie et al. [25], there is a requirement for cross-species analysis. When using other species for protein identification with mass spectrometric data, orthologue sequences are preferable as they are more likely to share similar functions [115]. Homologous sequences originate from a sequence in a common ancestor. The sequences are considered different or orthologues when they diverged by a speciation (inter-species) event. Paralogous sequences are sequences that came from a common ancestor and are present in the same genome. However, duplication event (intra-species) transpired in the sequences produce paralogous sequences which may or may not share similar functions [116].

Database dependent- and independent strategies are the two approaches used to execute confident cross-species protein identification. In the former approach, search engines such as Mascot (http://www.matrixscience.com/) are used to search peptide sequence data that contains precursor ion mass and a list of product ion masses against a taxonomically confined database [117]. Only a massive amount of peptide masses generated can guarantee its success as this increases the matching probability of several peptides to the homologous protein. This has been demonstrated for pea chloroplast proteins [118] and maize proteins [119]. In a database independent strategy, fragmentation spectra are utilized to obtain de novo peptide sequences [120]. MSBLAST, which uses a combination of BLAST search and peptide de novo sequences, had been adapted for tandem mass spectrometry data to increase the accuracy and hit rate of protein identification [121]. Combination of top-down and bottom-up mass spectrometry is the concept of TBNovo for de novo peptide sequencing to increase sequence coverage [91]. Essentially, tandem mass spectrum from a top-down analysis is utilized as a scaffold while bottom-up tandem mass spectra are aligned to the scaffold. DeepNovo is another model for de novo peptide sequencing which is able to perform complete protein sequence assembly without any reference databases [19]. The deep neural network model learns the features of tandem mass spectra, fragment ions and sequence patterns of peptides to do de novo sequencing. Analyzing modified (in peptide sequences) proteins without completed and annotated genomes has proved to be a daunting effort. The identification of modified peptides and its modification sites are essentially based on single amino acid identifications and become largely irrelevant when the peptide sequence is not available in any plant database. Application of a de novo sequencing strategy is able to facilitate the identification of modified peptides and may even help to locate the modification site, albeit with requirements for high quality tandem mass spectrometry spectra and certain preferable fragmentation techniques such as electron capture dissociation or electron transfer dissociation [25, 40].

Information on protein localization also helps in understanding the function of proteins and their biological inter-relationships. The Subcellular location database for Arabidopsis proteins (SUBA4) provides the hypothetical localization of many proteins that were identified in various sub-plastidial compartments in A. thaliana [122, 123]. LocSigDB is another database that contains 533 protein subcellular locations signals based on 518 experimentally confirmed and published research works [124]. The localization signals are for eight distinct subcellular locations in mainly eukaryotic cell, such as ‘Nuclear localization signal’ and ‘Mitochondrial targeting signal’. Plant Proteome Database (PPDB) was launched in 2004 for A. thaliana and maize (Z. mays) [125]. PPDB was developed to accommodate plant plastids, but over time, the database expanded to cover the entire proteomes of those two plants. The database consists of cell type-specific proteomes (maize) or specific sub-organelle proteomes such as chloroplasts, thylakoids and nucleoids as well as whole leaf proteome (maize and A. thaliana). More than 16,414 A. thaliana proteins, prominently from the plastids, have been assigned with subcellular locations. Table 3 listed other open-source software available to analyze large proteomics data.

Table 3 Common open-source software to analyze multi-omics dataset

5 Significance of Post-translational Modifications

Almost all proteins are modified in some way following protein biosynthesis. Many physiological responses result from differential protein modifications rather than changes in protein expression levels. These modifications do not create novel proteins but rather a new ‘protein species’ since the translated protein sequence remains unaltered [146, 147]. The modifications occur through covalent binding of functional groups such as phosphates, sulphates, carbohydrates and lipids [148]. This event, which is known as PTM, is one of the key mechanisms that changes the properties of a protein in cells and greatly enhances the structural diversity and functionality of proteins. This is feasible because PTMs provide a larger repertoire of chemical properties than is possible using the 20 amino acids specified by the genetic code. Protein PTMs could result in alterations in activity, localization, production, interactions with other proteins and half-life [149,150,151]. Modifications are often permanent, but some modifications, such as phosphorylation, are reversible and can be used to switch protein activity ‘on’ and ‘off’ in response to intracellular and extracellular signals. For example, in a signal transduction process, kinase cascades are activated or inactivated through reversible addition and removal of phosphate groups. The esterification of an amino acid side chain through the addition of a phosphate group introduces a strong negative charge, which can subsequently modify the conformation of the protein and alter its stability, activity and potential to interact with other molecules. Genomic sequencing has revealed that protein kinases are probably coded by 2–3% of all eukaryotic genes [152]. PTM is therefore a dynamic phenomenon with a central role in many biological processes. Generally, in regulatory pathways, the status of serine, threonine and tyrosine is regulated by protein kinases and phosphatases [153, 154]. Interference with the activities of the kinases and phosphatases indirectly disrupts these regulatory pathways and may cause disease [155, 156].

The complexity of the proteome is increased significantly by PTMs, particularly in eukaryotes where many proteins exist as a heterogeneous mixture of alternative modified forms. Ideally, it would be possible to catalogue the proteome systematically and quantitatively in terms of the types of PTMs that are present, and specify the modified sites in each case. However, such attempts are complicated by the sheer diversity involved and the transient nature of certain modifications. Every protein could potentially be modified in hundreds of different ways, and might contain multiple modification target sites allowing different forms of modification to take place either singly or in combinations. Thus, it remains the case that most PTMs are discovered unintentionally when individual proteins, complexes, or pathways are studied. It is impossible to predict modifications accurately from the genome sequence. Even when a definitive modification motif is present; it is not necessarily the case that such or any modification will happen.

Until recent years, the analysis of PTMs at the proteomics level has received limited consideration due to the lack of appropriate techniques [148]. However, improved separation methods can resolve different post-translational variants, and gels can be stained with reagents that recognize particular types of modified proteins. Mass spectrometry is at present the method of choice to characterize chemical additions and substitutions. Mass spectrometry analysis can be used to identify peptides carrying chemical adducts and can deduce their positions in the protein sequences.

Signaling proteins and regulatory molecules, which play a vital role in functioning, are typically presence in lower amount in the cell. They are also often regulated by phosphorylation. Since the stoichiometry of phosphorylation is usually low, the modified target protein may be present in limiting amounts and may be difficult to detect and quantify. Ultimately, even if adequate amounts of a particular variant are available, a large quantity of the sample is required for the full characterization of modifications compared to the relatively simple matter of protein identification. Currently, affinity-based techniques are employed to improve the chances of detecting their targets by isolating sub-proteomes with particular types of modification [157].

Investigations into PTMs and differentially expressed proteins are essential to comprehend cellular responses towards changes in environmental conditions [158]. It is clear that plants induce a complex array of pathways and protein phosphorylation cascades during biotic and abiotic stresses [159]. There are over 200 possible PTMs that have been identified and reported [160,161,162]. Until recently, more than 90,000 of individually modified amino acid residues were found [163], emphasizing the importance of these PTMs in a functional proteome. Phosphorylation is the most common and extensively studied PTM using mass spectrometry approaches [30, 148, 149, 164,165,166,167,168,169,170]. The justification is that phosphorylation is one of the primary mechanisms in cellular process regulation [171].

Mass spectrometry-based approach has enabled absolute and relative quantitation of peptides and their PTMs. Internal standard peptides are employed for absolute quantitation for certain proteins and their defined PTMs [172]. Relative quantitation is performed with either the peptide intensity profiling (PIP) or SIL using stable isotope-encoded chemical precursor molecules or alkylating reagents [149, 173].

In an oil palm phosphoproteomics study by Lau et al. [17], 3-enoyl-ACP reductase was deactivated through phosphorylation to direct the metabolic flux towards the production of palmitoyl-ACP during the final phase of the fatty acid biosynthesis. Palmitoyl-ACP is a crucial precursor for the biosynthesis of unsaturated oleic acids. Furthermore, the study discovered that subunit biotin carboxylase of acetyl-CoA carboxylase was also deactivated through phosphorylation at the same phase. The deactivation would have stop the production of malonyl-ACP, which is the carbon precursor for the initial stage of fatty acid biosynthesis in the oil palm.

5.1 Tracking the Phosphopeptides

A wide array of approaches can be used to scrutinize phosphorylation changes in cell or tissues. Radiolabeling is a classical technique that uses radiolabeled 32P-orthophosphate to detect phosphoproteins. Radioactivity can be very inconvenient, harmful and detrimental in the long term, both to the users and samples [174, 175]. Alternatively, after separation by two-dimensional gel electrophoresis [176], phosphoproteins can be directly visualized on the gel using phosphospecific fluorescent stains and phosphospecific antibodies, which are non-radioactive [52, 53, 56, 177, 178]. Immuno- or Western blot is the most common method used to assess the phosphorylation state of a protein using phosphospecific antibodies (for phosphorylated tyrosine, serine and threonine) transferred from a one-dimensional or two-dimensional gel electrophoresis [179, 180]. In direct staining, phosphospecific stains such as a fluorescent phosphosensor dye, Pro-Q Diamond (Invitrogen) bind directly to the phosphate moiety of phosphoproteins [56, 57, 181]. The advantages of this stain are in its compatibility with other staining methods and the ensuing mass spectrometry analysis. This is particularly crucial when trypsinizations are performed directly on the gel pieces. A similar phospho-specific staining kit called Phos-tag had been used previously in which a Zn2+ ion chelator with high selectivity was coupled to a fluorophore [182]. The suitability of these stains in phosphoproteome analysis had been described in previous reports. Agrawal and Thelen [55] identified 70 non-redundant phosphoproteins that belonge to the major functional classes from a Pro-Q Diamond stained two-dimensional gel containing rapeseed (Brassica napus) proteins. However, while phosphoproteins could be detected, the stains would not indicate the phospho-sites, which is vital in the characterization of phosphorylation events. Special techniques are used to investigate membrane phosphoproteins due to the limitations in two-dimensional gel electrophoresis technique. Integral membrane proteins tend to aggregate during the isoelectric focusing migration and thus, it is not possible to separate them in the second dimension of two-dimensional gel electrophoresis.

The low abundance of phosphorylated proteins in cellular extracts and their relatively low degree of phosphorylation pose major challenges [183,184,185]. In mass spectrometric analysis, non-phosphorylated peptides often compete with the phosphorylated peptides for ionization. As a result, many phosphoprotein peaks are difficult to detect, either because they have low signal to noise ratio or they are not ionized at all. Therefore, to tackle this obstacle, enrichment techniques, which are commonly applied prior to separation using liquid chromatography, have been used. Immobilized metal affinity chromatography (IMAC) is one of the methods that are used to enrich phosphopeptides from complex mixtures based on affinity of positively charged metal ions (Fe3+, Al3+, Ga3+ or Co2+) towards phosphate moieties. Iminodiacetate and nitrilotriacetate are the prototypical metal-binding ligands used in IMAC stationary phases [186,187,188,189]. The Fe(III)–NTA complex is perhaps the most frequently utilized to enrich phosphopeptides although the use of other metal–ligand complexes had also been reported [190, 191]. Most recently, Zr(IV)–phosphonate immobilized on various stationary phases had also been employed for phosphopeptide enrichment by several groups [192,193,194,195,196]. The phosphopeptides can be eluted by different salt- and/or pH gradients prior to mass spectrometry analysis. Nonetheless, several challenges arise when using IMAC. Leaching of ions from the column during enrichment steps, non-specific binding of peptides that contain the acidic amino acids glutamic and aspartic acid and higher specificity for multiply phosphorylated peptides are amongst those complications [186].

Metal oxide affinity chromatography (MOAC) is another valuable technique to isolate phosphopeptides from complex mixtures with high selectivity and recoveries [186, 197,198,199]. The metal oxides are often more stable at high temperatures and broad pH range [200]. Titanium oxide (TiO2) is the most popular metal oxide resin used as a selective affinity support to capture phosphorylated peptides [201,202,203,204,205,206]. At acidic pH, TiO2 has a positively charged surface [207] that permits very selective enrichment of phosphopeptides from complex samples by their affinities (phosphate groups) toward porous TiO2 particles (Titansphere) [208]. Water-soluble phosphates are desorbed under alkaline conditions. Strong cation exchange and titanium dioxide-type columns have both been used in phosphopeptide enrichment and SILAC for quantitation to study phosphorylation changes [209].

Technical variations and bias in quantitative analyses are often reported to occur after phosphopeptide enrichment [43]. Nonetheless, successful identification and quantitation of phosphopeptides has been reported using a combination of enrichment strategies and label-free quantitation as in the case of Arabidopsis phosphopeptides from a plasma membrane fraction following sucrose treatment [58] and a hypersensitive response study in tomato plants [59]. In addition, iTRAQ labelling has been successfully used to quantify phosphorylated peptides in Arabidopsis cells as their defense response to P. syringae induction (elicidators) [74]. As a rule of thumb, it is more effective to perform chemical labelling prior to any enrichment strategies due to the fact that enrichment steps confer technical bias in quantitative analyses [43].

Hydrophilic interaction chromatography can also be used as in the pre-separation stage of peptides prior to phosphopeptide enrichment such as IMAC or TiO2 affinity purifications, in addition to MudPIT LC. Hydrophilic interaction liquid chromatography (HILIC) separates polar biomolecules by the binding of the polar biomolecules to the neutral, hydrophilic stationary phase in hydrophilic interaction chromatography through hydrogen bonds. These bonds can be broken by reducing the organic composition in the mobile phase and the peptides eluted based on their polarities [210].

PTM occurrences can also be detected through neutral loss-triggered tandem mass spectrometry (NLMS3) and SRM approaches. In a phosphoproteomics study, the phosphate group of a phosphopeptide is relatively labile and tend to break away during collision-induced fragmentation, in the form of a phosphoric acid (HPO3 or H3PO4). Hence, the fragmentation of phosphoamino residue-containing (serine, threonine and tyrosine) precursor ions generates neutral losses of 80 Da (HPO3) or 98 Da (H3PO4) [211]. Usually, the mechanism for loss of H3PO4 from phosphoserine and phosphothreonine containing peptide ions is the result of a β-elimination reaction [212]. In a β-elimination reaction, the hydrogen atom on the α-carbon of the phosphorylated amino acid residue is transferred to the phosphate oxygen. As a result, dehydroalanine-(69 Da) or dehydroaminobutyric acid-(83 Da) containing product ions from phosphorylated serine or threonine residues, respectively, and H3PO4 are produced. Loss of H3PO4 is more dominant for serine phosphorylated peptides and in a lesser extent in threonine phosphorylated peptides. This might be caused by the steric hindrance of the β-methyl group in the side chain of threonine [213]. Tyrosine phosphorylated peptides give a much lower extent of neutral loss and these are in the form of HPO3. Mass spectrometry-based strategies such as the NLMS3 and SRM are essentially built on the detection of the characteristic neutral loss generated during the CID fragmentation of phosphoamino-containing peptides. In the NLMS3 mode of operation, the diagnostic neutral loss of H3PO4 (98 Da) from the precursor ion in a tandem MS scan automatically triggers the MS3 fragmentation of the neutral loss precursor ion. The aim of the MS3 is to compensate for the lack of sequence-specific information in the MS2 spectra of phosphorylation-modified peptides [213] although a study by Villen and co-workers indicated that the collection of MS3 scans did not improve the informative spectra of the peptides identified [214]. MS3 operates in a data-dependent manner, in which the MS3 is triggered by the presence of an intense product ion peak with the mass of a neutral loss. This strategy has been extended to detect phosphopeptides in the oil palm mesocarps by Lau et al. [17]. These neutral loss species from the product ions were calculated on the basis of the product ion mass and charge state, resulting in the neutral product ions of m/z 48.99 or m/z 32.66, relative to the doubly or triply charged phosphorylated product ion. A common problem with a neutral loss scan to detect phosphopeptides is that unassigned peptides may generate ions with a mass similar to the neutral losses as well [214, 215]. There are also instances when a phosphopeptide fails to generate the specified neutral loss and therefore are not detected [216].

SRM is an ideal complementary technique to reliably target and quantitate low abundancephosphopeptides of interest [217,218,219,220,221]. SRM is predominantly performed on a triple quadrupole mass spectrometer as the availability of additional mass filter (third quadruple, Q3) is exploited to isolate targeted fragment ion for MS2. However, in the study by Lau and co-workers, a quadrupole-TOF was used to scan the targeted precursor ion for any loss of neutral loss species (98 Da) instead [17]. Absence of the Q3 mass filter in the quadrupole-TOF implies that only neutral loss species at 98 Da correspond to the loss of H3PO4 can be detected after collision-induced fragmentation of the selected precursor ions in Q2. The ideal prerequisites to targeted SRM experiments are the prior knowledge of the primary sequence, type of phosphorylation, sequence motif and predicted fragmentation pathways to identify the potential phosphopeptides (for example, a neutral loss of 98 Da from phosphoserine and phosphothreonine peptides but not phosphotyrosine peptide).

5.2 Prediction of Post-translational Modifications

Automated prediction of PTM sites is one of the main interest areas for bioinformatics investigations. In vivo and in vitro determinations of modified proteins and their PTM sites are not only time-consuming and tedious, but often restricted to the availability and optimization of enzymatic reactions in order to determine the type of modifications and sites [222,223,224]. Tandem mass spectrometry spectra offer the most informative fingerprints of modified peptides. The spectra encode not only peptide sequences, but also the masses and sequence positions of modifications. For these reasons, computational techniques have been employed to manage the massive amounts of fragmentation spectra, modified protein determination and individual PTM site identification with high accuracy as well as efficiency [225]. The current PTM prediction tools basically are classed into four major groups based on their types of classification schemes [226].

The first group comprises general PTM related resources such as PROSITE [227] which predicts types of PTMs based on their sequence pattern consensus. Several signature recognition methods are combined to probe a query protein sequence against observed protein signatures. The Scansite tool predicts kinase-specific and signal transduction relevant motifs [228]. Conserved sequence motifs represent imprints of important biochemical properties or biological functions of those proteins.

The second group consists of various neural network prediction tools. These tools cover phosphorylation related prediction servers such as NetPhos [229] and NetPhosK [222, 230]. NetPhosK is the most popular since the server allows a preferred ‘threshold’ value to be indicated during prediction.

The third group of the prediction tools encompasses different support vector machine based prediction techniques. These methods are constructed on the basis that adjacent residues to the phospho-sites represent the main determinant for kinase specificity [224, 231]. For instant, PredPhospho [232] aims to predict phosphorylation sites and the type of kinase that acts at each site. AutoMotifServer [233] also predicts PTM sites in protein sequences using support vector machine classifier with both linear and polynomial kernels. KinasePhos 2.0 is the web server to identify protein kinase-specific phosphorylation sites based on amino acid residue sequences and coupling patterns [234]. PHOSIDA is also capable of predicting phosphosites [235].

The final group consists of remaining types of machine learning based PTM prediction tools. Prediction of PK-specific phosphorylation site (PPSP) adopted the Bayesian decision theory to predict kinase-specific phosphorylation sites and has been reported to produce precise prediction of the probable phosphorylation sites for about 70 protein kinase groups [226]. PPSP_balanced model worked remarkably for all types of protein families. Ascore, a probability-based score that was developed by Beausoleil and his colleagues, used the presence and intensities of site-determining ions in tandem mass spectrometry spectra to calculate the probability of exact phosphorylation site localization [236]. Ascore re-evaluates the results from search engine on phosphopeptide and designates a confidence value to each of the phosphorylated site. PhosphoScore is another algorithm which acts similarly to Ascore. PhosphoScore considers both the match quality and the normalized intensity of observed spectra peaks compared to a theoretical spectrum. PhosphoScore was employed successfully in the studies done by Ruttenberg et al. [237]. PTMap is a sequence alignment software used to identify protein PTMs and polymorphisms [238]. The selection of peak, adjustment of inaccurate mass shifts and precise localization of PTM sites are the features that improved searching speed and accuracy of PTMap. This software is the first algorithm that contains a scoring system which concentrates on unmatched peaks to eliminate false positives, thus increasing the accuracy and sensitivity of the PTM identifications. Table 4 summarized some of the main in silico tools to study several common PTMs.

Table 4 In silico tools to study some of the common post-translational modifications

There are numerous database which provides information on PTMs. dbPTM database [251] gathers various information such as the catalytic sites, protein domains and protein variations, in addition to these software or tools. These databases include a majority of experimentally validated PTM sites from SwissProt and Phospho.ELM. Phospho.ELM comprises over 40,000 amino acid serine, threonine and tyrosine non-redundant phosphorylation sites from vertebrates, Drosophila melanogaster and Caenorhabditis elegans [252]. Similarly, PHOSIDA (http://www.phosida.com) comprises more than 80,000 phosphorylated, N-glycosylated or acetylated sites from nine different species [235]. For each of the phosphosites, PHOSIDA lists matching kinase motifs, predicts secondary structures, conservation pattern, and its dynamic regulation upon stimulus. Unfortunately, none of these species are plants. PhosphoSitePlus (http://www.phosphosite.org) has 130,000 non-redundant modification sites, primarily on phosphorylation, ubiquitination and acetylation [253].

6 Conclusions

In the post-genomics era for perennial oil crop improvement such as oil palm, it is crucial to first map the entire set of the proteins using the emerging proteomics technologies. Given the lack of protein-level information of the oil palm genes sequenced so far, a systematic effort using the tools of proteomics is essential to elucidating biological functions of interest based on these genomic sequences. Knowing the key controlling mechanisms for metabolic processes such as fatty acid production and plant defense towards pathogen through proteome-wide protein quantitation is significantly important. Subsequent PTM analysis and protein–protein interaction mapping can eventually help to predict the regulatory networks under different planting environments. These information are crucial to strategize breeding programs and to discover biological significant markers for oil palm fruit growth and development, to improve the yield and quality as well as to enhance the plant immunity towards various environmental stresses, in particular, diseases that has obstructed the optimal production of palm oil.