Keywords

1.1 Introduction

Bioinformatics is a multidisciplinary field incipient from the interaction of information, statistics, and biological sciences to analyze genome and/or proteome contents, sequence information, and predict the function and structure of cellular molecules that are used in construing genomics and proteomics information from an agricultural organism (Benton 1996; Bruhn and Jennings 2007). Bioinformatics is considered relatively new yet is a significant discipline within the biological sciences that has offered scientists and agrobiologists to interpret and handle huge amounts of information (Bartlett et al. 2016). This amount of data produced lead to the advancement and development of bioinformatics. The multi-“omics,” together with computational biology are considered important tools in understanding genomics and its products which trigger several animal, plant, and microbial functions (Mochida and Shinozaki 2011). The functional analysis for those organisms includes profiling of gene products, prediction of interaction between proteins and their subcellular location, and also the prediction of protein metabolism pathway simulation (Xiong 2006). Bioinformatics as a tool is not isolated but frequently interacts with other biological sections to produce assimilated results. For example, prediction of the structure of a protein depends on gene sequence and gene expression profiles, which require the use of phylogeny tools in sequence analysis. Therefore, the field of bioinformatics has developed in a way that the most important duty now include the interpretation of different types of information, including nucleotide and protein sequence and protein structures and function (Moorthie et al. 2013; Merrill et al. 2006). The analysis of DNA/RNA sequences, protein sequences and function, genome analysis and gene expression, and protein involvement in physiological functions can all make use of bioinformatic methods and tools and cannot be done without it (Collins et al. 2003). Protein sequence information and its related nucleic acid and data from many agricultural species deliver a substance for agricultural research leading to a better understanding of global agricultural needs and challenges (Kumar et al. 2015). Utilizing the available information allows and assists the identification of expression of a gene which may help to understand the relationship between phenotype and genotype (Orgogozo et al. 2015). The involvement of proteomic applications for analyses of crop, animals, and microorganisms has rapidly increased within the last decade (Mochida and Shinozaki 2010). Although proteomic approaches are regularly used in plant research worldwide, and establish powerful tools, there is still a significant area for improvement.

Proteoinformatics could be defined as “utilization of computational biology tools in the study of the proteome.” Proteoinformatics is a field involving mathematics, programing sciences, statistics, and protein biology and biochemistry to predict and analyze their structure, function, and role in cell physiology (Cristoni and Mazzuca 2011; Hamady et al. 2005). Since the data obtained from agricultural proteomic research are complex and massive in size, the role of proteoinformatics is essential to reduce the time for investigation and to deliver statistically significant results and that will help to improve the plant/animal quality based on healthy growth and high productivity. Thus, proteoinformatics is a dynamic field for the development of new breed’s diagnostic tools in order to develop pathogen-free/resistance and abiotic stress tolerance, high-quality traits, and higher quantity production (Koltai and Volpin 2003).

1.2 Proteoinformatics in Plant Disease Management

Among different plant pathogens, such as viruses, bacteria, and oomycetes, fungi are considered the most destructive (Dangl and Jones 2001). The growth, propagation, and survival strategies of pathogens are varied, but the strategies, in general, are similar, which start by colonization and progress to overcome host defense system and then finally infection establishment (Pegg 1981; Lawrence et al. 2016). As a result, the host-pathogen systems have led to a complex relationship between the host and the pathogen molecules, resulting in relationship with a high degree of variation (Hily et al. 2014). Proteomic studies focused mainly on the response of host plant upon pathogen attack that opened up a new era for biology in general and for agriculture in particular (Lodha et al. 2013; Alexander and Cilia 2016). Along with the use of proteomic approaches in agricultural research and the progress in sequencing agriculturally important organisms, the combination of bioinformatics and proteomics generally enhance the research in this area. This kind of multidisciplinary research is likely to fill in the gap toward the understanding of host-pathogen interaction network (Koltai and Volpin 2003). Two-dimensional gel electrophoresis has been initially used for rapidly identifying major proteome differences in control versus inoculated plants. Although many proteins identified during host-pathogen interactions have been highlighted, majority are known previously and are mainly in host immunity mechanism (Memišević et al. 2013). However, those results that arise from proteomic-based research are of great significance for the validation of gene expression in genomic or transcriptomic studies (Nesvizhskii 2014). Nevertheless, by using the gel-based proteomic tools, little novel information has been obtained, especially due to the lack of sufficient bioinformatics-related information such as genome sequences (Cho 2007). Indeed, only the most abundant proteins are detected in two-dimensional gels and successfully identified by mass spectrometry (MS). Therefore, a gap seems to be in the bioinformatics channel for the proteomics research of organisms without complete genome sequencing (Sheynkman et al. 2016). These information-related limitations in agricultural proteomic research need to be overcome to increase our knowledge on protein expression during plant-microbe interactions. However, proteomic tools have grown rapidly, and new approaches and apparatus are being developed (Mehta et al. 2008; Pérez-Clemente et al. 2013). Previous agricultural proteomics research, which mainly focused on model crops, has provided fundamental understandings into different protein families in agri-organism systems’ modification and regulation (Hu et al. 2015; Vanderschuren et al. 2013). Nonetheless, model crop research itself does not retain all the information and data of interest to agricultural biology (Mirzaei et al. 2016; Carpentier et al. 2008). Therefore, those crops without complete genome sequence or sufficient genomic/EST information freely available need to be investigated (Ke et al. 2015; Ekblom and Wolf 2014). In comparison to the model organisms related to agriculture, such as rice (Koller et al. 2002), maize (Pechanova et al. 2010), chicken (Burgess 2004), cattle (Assumpcao et al. 2005), brewer’s yeast (Khoa Pham and Wright 2007), and the plant pathogen Botrytis cinerea (Fernández-Acero et al. 2009), non-model species with little or no “bioinformation” was largely affected when it comes to proteomic analysis (Armengaud et al. 2014). Economic significance and the complexity of the genome make it necessary to sequence that organism (Bolger et al. 2014), but that is not enough to make it as a model organism if that information is not reachable by the scientific community (Canovas et al. 2004), Table 1.1 shows proteomic study of non-model organism. Most mass spectrometry proteomic methods depend on complete sequence for identification; for that reason, the analysis of these non-model species remains a challenge. Thus, relying on complete and comprehensive established database for the closely related model species “conserved genome region within the species of family” will be the only choice (Hutchins 2014; Zhu et al. 2017; Bischoff et al. 2016). However, sequence variation remains an issue, especially for quantitative proteomics approaches, which will lead to low coverage of protein identification (Chandramouli and Qian 2009; Zhan et al. 2017). Moreover, “conserved genome” regions may produce similar protein sequence with different cellular functions and may increase the number of mismatch protein identities (Khan et al. 2014). Gel-based proteomics is considered the most dominant platform used for agricultural proteomic research (Tan et al. 2017). However, the use of gel-free proteome analysis is increasing rapidly in agricultural research with the presence of more proteoinformatics data (Porteus et al. 2011; Komatsu et al. 2013). Pathogen proteins that are used to suppress host defenses are of high importance in agricultural host-pathogen interaction, as these proteins may play a role in virulence, pathogenicity, and effector molecules (Van De Wouw and Howlett 2011). Pathogen characteristics are of primary interest in crop development programs (Fletcher et al. 2006). The contribution of proteoinformatic advances has helped the sequencing of the entire genomes of many pathogens in the last 10 years (Land et al. 2015). Classical biochemistry and molecular biology, as well as the modern omic platform techniques coupled with bioinformatic tools research, have been conducted on agricultural-related pathogens and their interactions with crops (Barah and Bones 2014). Recently, the study of pathogens have been significantly promoted by the availability of bioinformatic data and the resources for multi-“omics” research (Bhadauria 2016). These approaches, in combination with gene-targeting studies such as targeted mutations and gene silencing studies, are explained in molecular host-pathogen communications and the complex mechanisms involving pathogenesis and virulence (Allahverdiyeva et al. 2015; McGarvey et al. 2009; Fondi and Liò 2015). The present efforts to provide sufficient “proteoinformation” to determine related proteins and their function have improved the capacity to understand the core causes of crop and animal diseases and develop new possibilities of treatments (Chen et al. 2010). Proteoinformatics has many practical applications in current agricultural-related disease management with respect to the study of host-pathogen interactions, understanding the nature of the disease genetics, pathogenicity, and/or virulence factor of a pathogen which eventually aid in designing better disease control and drive the infection process which has also been identified, using molecular biological technologies and genetics in identifying the interaction with bacteria such as tomato and Pseudomonas syringae (Parker et al. 2013b; Balmant et al. 2015) and rice and Xanthomonas oryzae (Wang et al. 2013b) or with virus such as potato and potato virus (PVY) (Stare et al. 2017) or with phytopathogenic fungi such as apple and Alternaria alternate (Zhang et al. 2015), strawberry and Fusarium oxysporum (Fang et al. 2013), cotton and rot fungus Thielaviopsis basicola (Coumans et al. 2009), and coffee and Hemileia vastatrix. Proteoinformatic tools and databases related to agricultural diseases need to be further developed and expanded. Obviously, tools, software, and databases are adapted from human and more specifically medical analysis systems, and these may not necessarily be a model for analysis of crop proteomic data; therefore, more information regarding those crops and their pathogens will be very helpful to fill in the proteoinformation gap in agricultural research and also to verify the protein information predicted in the literature (Dennis et al. 2008; Thrall et al. 2011; Van Emon 2016). Generally, the proteoinformation is larger and more complicated than the genoinformation, especially in crops, since there are more proteins than genes. That is mainly because of the post-translational enzymatic modification. The nucleotide sequence can represent the genome of an organism; on the other side, peptide sequence cannot represent the proteome for that organism unless the structure of an interaction between those proteins revealed (Gupta et al. 2007; Khan 2015).

Table 1.1 Proteomic studies on non-model-pathogen interaction (2008–2018)

1.3 Proteoinformatic Databases and Tools

Sequencing projects of crops and animals related to agriculture bring the number of proteomic research in this field higher. Proteoinformatic methods and tools could be used to identify a specific protein of interest within the proteome of an organism which could be valuable for community related to agriculture and to interpret their cellular functions. The different and unusual protein information might be used to develop drought- and salt-tolerant crops, for diseases resistance and improvement of livestock, and higher productivity (Fears 2007; Gong et al. 2015; Ahmad et al. 2016). As discussed, a closely related sequence for a specific crop or animal can be used if genome information is not accessible. The ever-growing databases of whole genome sequence remain to accelerate capabilities of proteoinformatics, till the time of writing this chapter; there are more than 500 plants with whole genome sequence from more than 5000 eukaryotic sequence since the first genome sequence of plant (Arabidopsis) in the year 2000 (Kaul et al. 2000). Bioinformatic investigations of the genome-based information from important commercial crops revealed that gene organization over evolutionary time has remained constant and conserved, which means that knowledge obtained from model plants such as Oryza sativa and Arabidopsis thaliana may be exploited to propose food improvement programs for monocot and dicot crops, respectively (Ong et al. 2016; Jayaswal et al. 2017).

In proteoinformatics, the term “peptide/protein sequence” implies subjecting those sequences or its related databases or other methods of bioinformatics on a computer. Sequence alignment in proteoinformatics is ordering the sequences of protein/peptide, RNA, or DNA to find similar regions that may be a sign of functional and structural relationship (Pearson 2013), some important proteoinformatics databases listed in Table 1.2.

Table 1.2 Proteoinformatics online databases/resources

Proteoinformatics is considered as an evolving field of agricultural research. Interpreting particular functions of crops/animals is essential to determine useful proteins to improve agricultural traits (Newell-McGloughlin 2008). The integration of proteoinformatics and other omic platforms databases from agricultural species is of high importance to promote/enhance crops/animals system to solve global issues such as food, water stress, and climate changes (Katam et al. 2015a). For Asia, for instance, the Asia Pacific Bioinformatics Network (www.apbionet.org) is a good regional source (Khan et al. 2013).

Besides the classical well-known database, many website-based database or platform content have served proteomics and have been used in agricultural research. The ExPASy Proteomics site, for instance, is considered as a tool developed for human proteomic research (Gasteiger et al. 2003; Hoogland et al. 2007); however, it is widely used to compute isoelectric point (pI) and molecular weight (Mw) for agricultural proteomic studies (Imam et al. 2014; Dahal et al. 2010; Guijun et al. 2006; Schneider et al. 2004; Lande et al. 2017). In general, regarding agricultural proteomics, there are a number of web-based proteomics databases that hold a plenty of efficient information (Martens 2011). Recently, a new website was developed for tracking information and articles related to the changes in plant proteomes in response to stress (PlantPReS; www.proteome.ir). Organelle proteomic analyses have also been performed in animal and plant databases such as Organelle DB (http://labs.mcdb.lsa.umich.edu/organelledb/) (Agrawal et al. 2011). Organelle expression proteomics was considered as successful tools focusing on subcellular proteins rather than total proteins (Yates Iii et al. 2005) such as mitochondrial proteome research in potato (Salvato et al. 2014), chloroplast in tomato (Tamburino et al. 2017), endoplasmic reticulum in rice (Qian et al. 2015), peroxisomes in spinach (Babujee et al. 2010), vacuoles in cauliflower (Schmidt et al. 2007), and nucleus in soybean (Cooper et al. 2011) because they have fewer proteins which can easily be identified since they contain a limited number of proteins; thus, protein identification will be more appropriate. In the last 30 years, gel-based proteomics has been used as a main platform for agricultural proteomics. The gel is stained to visualize the proteins that have travelled to specific locations in the gel. For complex samples, proteins are analyzed after enzymatic digestion (Padula et al. 2017). Many software programs were developed for gel analysis (single stained and 2D-DIGE) and used in many agricultural-related proteomic research, most of which are commercial software such as Delta2D (http://www.decodon.com/delta2d.html), ImageMaster 2D Platinum, Melanie 9 (http://2d-gel-analysis.com/), PDQuest (http://www.bio-rad.com/en-ch/product/pdquest-2-d-analysis-software), Samspots, SpotsQuest and SpotMap (http://www.cleaverscientific.com), and Dymension (http://www.syngene.com/dymension). While some of the free available software have not survived and they are either not available for download or totally discontinued such as Gel IQ from (http://ludesi.com/), there are few software which are still available and functioning (Maurer 2016; Singh 2015) such as Gel2DE, SDA for DIGE analysis, and RegStatGel (http://www.mediafire.com/FengLi/2DGelsoftware).

Followed by protein separation, the peptide MS/MS fragmented spectra are matched against the available sequence in the database for protein identification. The peptide sequence identification is obtained based on the similarity score among the experimental MS/MS and the theoretical MS/MS spectra. The mass spectra obtained during protein identification are matched with the hypothetical one existing in the database and a statistical score, based on the spectrum resemblance, is associated with the protein identification. The restraint of this approach is that only known proteins/genes reported in the database can be identified (Nilsson et al. 2010). Recently, NCBI dropped “gi number” identifier and replaced the NCBInr database with a newer database named NCBIProt which is more complicated yet more comprehensive (Disruption ahead for NCBI databases 2016). The only disadvantage of this new database is that it is time-consuming to search for non-model organism although slight improvement was noticed (data not shown). De novo sequencing can be the method of choice when the protein, in this case, the sequence is obtained directly from the MS/MS spectra to skip the step of database spectrum search. The resulted sequences are then compared with those contained in the database so to detect homologies (Ekblom and Wolf 2014).

Database search software programs/tools is listed in Table 1.3 together with those employed for de novo searching. An example of software used for de novo peptide sequencing is the Novor (www.rapidnovor.org/novor), which is capable of performing real-time de novo sequence analysis with high accuracy (Ma 2015).

Table 1.3 List of mass spectrometry search-related software/websites

1.4 Protein-Protein Interaction Software and Database

Physiological and molecular cell processes are mainly carried out through the interactions between different proteins. Interactions are physical relations between different protein structures via weak bonds (Khazanov and Carlson 2013; Chang et al. 2016). In agricultural proteomic research, identifying protein identities binding or interacting with each other during certain defined circumstances and determining the protein-binding site are of very high importance for a better understanding of the bases of many biological/physiological activities.

Protein interactions play a significant role in protein characterization and the discovery of protein functions and the pathways they are involved in (Rao et al. 2014). This is especially true during mutualism (symbiotism), commensalism, and parasitism interaction which is caused by specific protein-protein interactions (PPI) between organisms (Leung and Poulin 2008). The precision of experimental results in revealing protein-protein interactions, however, is rather doubtful, and the availability of high-throughput platforms has shown inaccuracy and false-positive information for protein interaction. Considering experimental restrictions and limitation to find all interactions in a specific proteome, computational prediction of protein interactions is a requirement to proceed on the way to complete interactions at the proteome level (Keskin et al. 2016). Affordability of high-throughput machines and the development of computational-based prediction methods have produced vast numbers of protein-protein interactions. Computational methods for protein-protein interaction predictions can use a variety of biological data gene and protein sequences, evolution, and expression. Algorithms and statistics are commonly used to assimilate these data and deduce PPI predictions (Clark et al. 2011). This ability to provide comprehensive and reliable sets of PPIs prompted the development of many databases, aiming to gather and unify the available data, each with a different focus and different strengths. List of PPI database and examples in agriculture are presented in Table 1.4. Protein-protein interaction has been investigated and studied in many agricultural-related research such as rice with specific network (http://bis.zju.edu.cn/prin/)(Gu et al. 2011; Zhu et al. 2011), Rhizoctonia solani-rice interaction (Lei et al. 2014), maize (http://comp-sysbio.org/ppim/) (Zhu et al. 2015), chicken, and cattle (Fen et al. 2016).

Table 1.4 List of protein-protein interaction (PPI) software/website

One of the most common databases in agricultural research is the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) (Szklarczyk et al. 2015, 2017b); it is another database that incorporates both known and predicted network between proteins. Currently, STRING database covers more than 2000 species, and it is expected to cover more than 4000 in its 11th version (current version 10.5). STRING can give 3D structure besides the interaction network of a given proteome, the database used widely in prediction of protein interaction in agricultural proteomic-related research such as crop under biotic stress (Liu et al. 2015; Vu et al. 2016; Al-Obaidi et al. 2016a; Wu et al. 2015), oil-crop metabolism (Raboanatahiry et al. 2017), phytopathogenic fungi (Chu et al. 2016; Li et al. 2017), mushroom cultivation (Rahmad et al. 2014), poultry (Broiler chicken) (Zheng et al. 2016; Zheng et al. 2014), and buffalo (Ashok and Aparna 2017). The interactive STRING network can be recalculated based on user setting and cut-off values as well as interaction score, the maximum number of shown interactions, and expended based on user selected. Currently, it is not clear whether protein-protein interaction networks and database are representing the true biological interactomes. For that reason, agricultural proteomic researchers should depend on their own valuation of biases and consider them when inferring any knowledge based on protein interaction networks. Besides the freely available database which predict the protein-protein interaction, commercially available software platforms such as Ingenuity Pathway Analysis (IPA) (https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/) and Metacore (https://clarivate.com/products/metacore/) are also considered great inclusive applications that enable analysis of many “omics” (Bessarabova et al. 2012; Yin et al. 2015) and agricultural proteomics as well; however, those software applications are mainly applied in medical proteomics rather than agricultural proteomics (Chen et al. 2013).

Proteomic analysis, in general, depends on data imaging which plays a serious role in understanding new results of proteomic research. In agricultural proteomic research especially for high-throughput experiments, heat maps are particularly suitable to achieve this mission, as they allow us to find measurable forms of result presentation across proteins concurrently. It is very useful to use heat maps for presenting comparative proteomic results organized in a simple yet expressive way. The superiority of a presented heat map can be highly improved by understanding and utilizing the options available in the online tools/software to organize the data in the heat map (Key 2012; Acton 2013). The idea of a heat map style of presentation appears to be originated from the use of color-based heat maps, which used to differentiate changes in temperatures. List of used websites/software to create heat maps used in proteomic research is listed in Table 1.5.

Table 1.5 Heat map generating tools/software/website

1.5 Proteoinformatics of Edible Mushroom

Information regarding the life cycles and metabolisms of edible mushroom is of high importance for designing workable, fruitful, and effective cultivation process, especially with fungal species that are hard to propagate and need a special medium, temperature, etc. (Zhang et al. 2014a). Research on edible mushrooms’ physiological changes, growth stages, development, interactions with the environment, and contribution in human diet used several different approaches from cell biology, physiology, and chemistry to the current and multi-omic techniques such as genomics (Chen et al. 2016), transcriptomics (Fu et al. 2017), proteomics (Rahmad et al. 2014), and metabolomics (Pandohee et al. 2015). Recently, the availability of bioinformation related many edible mushrooms species helped to conduct many proteomic researches, thanks to the availability of their genome sequencing (Shim et al. 2016; Yang et al. 2017) due to the high request for edible mushrooms and their importance in food industry, medicine, and healthcare (Yap et al. 2014).

The availability of genome sequencing for those edible mushrooms allow researchers to run genome-based proteomics (Yap et al. 2015), which provided esteemed information for initiating molecular-based markers that can be used to improve the quality and usage of edible fungi. Recently, the importance of applying proteomic platforms in edible mushroom research has been highlighted, especially with nutraceutical and medicinal application possibilities (Al-Obaidi 2016b). Mushroom genome sequences make it possible for researchers to conduct research on mushroom growth (Tang et al. 2016; Wang et al. 2013a), developmental stages (Rahmad et al. 2014; Yin et al. 2012), and higher fungi medicinal properties (Yap et al. 2014).

1.6 Proteoinformatics of Animal Breeding Programs

The final products of terrestrial (cattle, poultry, and sheep) rigorous animal agro-farming systems have conventionally been mainly meat and milk products, fish, and other products from the aquaculture segment where both gained importance in terms of capacity and nutritional properties. Fundamental proteomics can be considered a promising tool for the discovery of protein diagnostic biomarkers for different and animal product quality markers.

Recently, the interest in studying livestock animals having proteomic and metabolomic platforms have increased rapidly (Suravajhala et al. 2016). Biomarker development in chicken was identified for different research goals, while in dairy cattle, numerous potential biomarkers were detected for meat and milk production (Goldansaz et al. 2017; Ortea et al. 2016). In domestic livestock and animal proteomics, the database search identification method in general is not an issue, since a comprehensive database of protein sequences is most probably available, databases such as MetasSecKB (http://bioinformatics.ysu.edu/secretomes/animal/index.php) can be considered as a good reference. On the other side, in the cases that the animal genome has not been sequenced or not complete, other approaches such as de novo peptide sequencing is usually used (Blakeley et al. 2011). Commonly in the absence of enough proteoinformation, search against a protein sequence from closely related organisms. Small differences in peptide sequence from the sample and the genome/proteome database entries may guide to a big difference in protein identities. This issue obscures proteome analysis for non-sequenced species and between different subspecies, where the difference in the amino acid sequence of proteins is highlighted possible (Ignatchenko et al. 2017). These approaches are considered significant bioinformatic challenges because there are several aspects that affect or add inconsistency to determine protein identities. The availability of sufficient proteoinformatics data, the study of protein identification and metabolomic changes research considered the source for building models of whole systems. Such systems will permit investigators to understand the function of the protein complex in response to disease and environmental changes (Romero-Rodríguez et al. 2014). In the animal breeding proteomic research, proteomics may help in the search of animal biomarkers and offer more accurate health measures for livestock, which are essential for improving the breeding program, disease resistance, stress tolerance, and environmental changes (Marco-Ramell et al. 2016).

1.7 Conclusion

This chapter has concentrated mainly on the application of software programs and databases of proteomics in agricultural sciences, where the organism with no or incomplete genomic sequence data makes the identification of proteins more challenging in comparison to those highly studied organisms. The power of multi-omic methods for high-throughput identification and characterization of candidate genes tends to be lost in non-model organisms due to the lack of sufficient biological information. It is likely that the availability and accessibility of more sequence in plant/fungi and other agricultural-related organisms will ease some of these difficulties by making genomic data available for many non-model organisms. However, proteomic studies accumulatively produce huge amounts of data. It is usually done collecting protein annotations from databases. Answering biological questions using these data is still a great challenge. In conclusion, key objectives for agricultural proteoinformatics include the encouragement of sequence submission and make it available to the public research community. Finally, proteoinformatic databases, software programs, and methods need to be designated and utilized in a better way. Many tools and databases are adapted from human and specifically medical-related examination systems, and these may not be perfect for the analysis of plant, fungal, and other related agricultural proteomic data.