Keywords

1 Introduction

Rice (Oryza sativa L.) is an important crop from Gramene family because most of the human population mainly depend on rice as their food [1]. However, recent climate change has resulted in drastic decrease in the yield and has also affected the rice genome at molecular level. Previously, it was reported that all the agronomic traits are controlled by various genes and considering that, to date, numerous genes and its sequences have been identified and used in the breeding program to improve the yield and quality of rice [2]. Additionally, since rice has a small genome size, it is often used as a model organism for the monocot plant species. Thus, there is always a quest to unmask the genetic structure and composition of rice genome. Though the recent advancement of high-throughput sequencing technologies has been able to solve these issues, they had led to the accumulation of huge amount of biological data. This in turn demands the development of various bioinformatic databases and tools, which may store and analyze data efficiently. Considering this, to date many rice-specific databases are developed that contain heterogeneous biological data ranging from genomics to proteomics and are regularly exploited to study and evaluate numerous agricultural-related information such as gene mapping, gene identification, functional characterization, gene expression study, structure prediction, protein structure, and function prediction. Thus, in this chapter authors have described in brief about various rice databases and information they contain.

2 Genomic Database

Gramene is a comparative genomic database of plants that incorporates knowledge regarding genetic charts, sequences, gene markers, proteins, pathways, and phenotypes [3]. One may search or explore the database to identify genes and phenotypes that share similar characteristics. Different plant organisms may also be correlated and differentiated considering the identical genes, genomes, mechanisms, and phenotypes. From the Gramene database, one may also know the position of genes in the chromosome and its function [4]. Using the genome browser module of the Gramene database, the annotation of rice genome is possible, which provides information on SNPs, indels, and markers for the respective genes. The rice genome also acts as the reference for genome-wide comparative study. In order to decide whether rice plants are linked to other cereals, the researcher uses a number of sequence-based tags including the sequences of expressed sequence tags (ESTs), flanking sequence tags (FSTs), proteins, cDNAs, BACs and BAC ends, SNPs, microarray probes, and repeat sequences [5].

The gene archive takes advantage of the genetic colinearity (synteny) between rice and other main crop plant genomes to annotate genome of organisms whose own plants have not been sequenced [5]. For performing annotation, Gramene employs several modules like BLAST tool, marker module, and QTL modules. BLAST tool aligns the query sequence with the reference genome available for the particular crop with many species. For Oryza, the database stored information on 11 wild species of Oryza such as Oryza barthii, Oryza glaberrima, Oryza glumipatula, Oryza longistaminata, Oryza punctate, Oryza rufipogan, Oryza brachyantha, Oryza sativa indica, Oryza meridionalis, Oryza sativa japonica, and Oryza sativa f spontanea. The BLAST alignment results provide information regarding the details of the alignment gene with E-value and identity percentage of gene with reference and the orientation of the gene in either positive or negative strand. BLAST has been discussed in detail in Chap. 7 of the book. QTL module provides information about QTLs of rice and several other cereals (described below in QTL section). Marker module provides information on markers available for all the genes/QTLs of the crops. The detailed information about microsatellite or SSR markers of rice genome with mapped chromosome position with location is also available, which in turn enables us to easily identify the gene and their nearest gene location. These data provide a gateway to find out the marker position of a particular gene with respect to various traits, like abiotic and biotic traits in rice.

Another widely used rice genomic database is “Rice Annotation Project Database (RAP-DB)” [6, 7]. The information about the rice genome is generated through the “International Rice Genome Sequencing Project (IRGSP)” and is stored for public use in RAP-DB [6, 7]. In this database, the Japanese rice Nipponbare (build IRGSP 1.0) is considered as a reference genome for all the rice genotypes. The genome sequencing was carried out through high-throughput Illumina sequencing platform. RicyerDB is another rice genomic database that contains information regarding the yield-related genes of rice. Both genomic and proteomic data are available in this database, which make researcher to easily get the gene information and the pathways involved for gene expression in rice [8]. The tools of RicyerDB provide base for the development of a shared platform for browsing and visualizations of yield-related genes. Another utility allows the user to search for a single gene and offers insight into the roles and positions of biological processes. The study of protein–protein interaction and protein–protein interaction network construct is also possible through this database. This database also stores information from various sources such as RAP-DB, Rice Genome Annotation Project (RGAP), NCBI, and UniProt and STRING databases.

RiceGAAS database is also known as “Rice Genome Automated Annotation System.” It allows to execute the genomic data of rice for public use [21]. The data are collected from various sources such as IRGSP [70] and submitted to the public domain database DDBJ (https://www.ddbj.nig.ac.jp/services-e.html) and provided all the information regarding Gene entry, gene homology identification, prediction of long terminal repeats, and gene models. The RiceRelativesGD is another gene database aimed at providing a genetic resource useful for rice breeding. In 2019, Mao et al. [71] developed this database for identifying the genetic information of close rice relative species. This database contains genomic information from 13 separate rice relatives such as O. sativa (japonica group), O. sativa (indica group), O. rufipogon, O. barthii, O. glumaepatula, O. meridionalis, O. nivara, O. punctata, O. brachyantha, Leersia perrieri, O. glaberrima, Zizania latifolia, and Echinochloa crus-galli which are accessible to the public. Their study provides knowledge on genes specific for different functions such as stress and photosynthesis that are used in breeding program.

3 QTL/Gene Database

QTL databases are used to identify the gene function that are used in breeding program [28]. For instance, Q-TARO database provides detailed information of rice QTLs. One important feature is that it enlists a table comprising all of the QTLs and their genetic parameters such as their trait or trait type, population, mapping, accuracy (LOD value), and map location of the QTL. Another important feature is the genome browser displaying genomic positions of QTLs. Q-TARO also specifically displays the colinear spatial structures of QTLs and QTL regions on the rice genome [27].

For plants, Gramene is another quantitative information resource that incorporates data through different data domains. In Gramene QTL database, QTLs are identified as a part of gene, which is associated with a particular phenotype. The QTL database includes the world’s largest online repository of QTL data for rice. QTLs initially published on individual genetic maps have been systematically aligned to the rice sequence using flanking markers as anchors, where they can be searched as normal genomic features. It enables the analysis of QTLs in colinear regions in other cereals and allows researchers, to distinguish sequences and QTLs correlated with related traits or phenotypes across a broad variety of plant species. Researchers can identify whether a QTL colocalizes with other QTLs and can integrate data from different studies to enhance the accuracy of a QTL location. It provides plant biologists and geneticists a way to investigate the interaction between genomic variation and diverse modes of phenotypic variation [28].

To improve QTL-based candidate gene recognition and gene expression study, PlantQTL-GE database has developed [29]. This database contains information about chromosomes and details on gene expression in microarray data and ESTs and genetic markers of Arabidopsis thaliana and Oryza sativa. Another database, namely the Institute for Genomic Research (TIGR) database, contains DNA, RNA, and protein sequence of plant, human, and microbes. This database consists of repetitive DNA sequence of 12 plant species, namely Arabidopsis, Hordeum, Brassica, Glycine, Lotus, Oryza, Triticum, Lycopersicon, Medicago, Sorghum, Solanum, and Zea. The repeated sequences within each database are further classified into subcategories namely groups and subclasses dependent on sequence and structure similarities [72]. Sequence similarity can also be checked for the downloaded files and are accessible from different sources [73].

4 Single Nucleotide Polymorphism (SNP) Database

SNPs, which comprise the most abundant type of genetic variation, are used in genetic studies [74]. SNPs play key role while studying gene mapping, diversity, and evolutionary variation among populations. They also play an important role in designing markers to identify the genetic variation occurring in the contemporary genome and that it has been transferred from the wild type. While other forms of variation including indels, microsatellites, variation of copy number, and epigenetic markers remain important to consider and can cause disease, in genetic study, SNPs are largely the simplest to determine, and are the most useful and commonly used markers. One of the most important SNP databases is dbSNP (http://www.ncbi.nlm.nih.gov/SNP), in which SNP identifiers (SNPids) or rsIDs are used to identify SNPs. This database arranges the nucleotide sequence based on their variation in sequence to differentiate between two sequences and search the polymorphism through nucleotide substitution, insertion, deletions, and nucleotide repeats [75].

By the sequencing project of 3000 rice genome, it has been possible to identify SNPs through the alignment of gene of interest with the reference genome in rice [76, 77]. This result provides the information of SNPs that are synonymous or nonsynonymous to the particular genome. The 3000-rice genome project has been described in detail in Chap. 5. Rice Variation Map database is developed for genomic variation study [30]. Yan et al. [78] developed a database of SNP for rice, namely IC4R, that provides SNP information of 18 billion reads. For commercial rice verities, they provide SNP barcode to easily assess the SNPs using seven machine learning-based methods that are DT, KNN, NB, ANN, RF, LR-M, and LR-O algorithms in the Python Sklearn Library (https://scikit-learn.org/stable/). To identify the genetic variation during various stress conditions, Rice Stress-Resistant SNP (RSRS) database has been developed [35]. Recently, Alexandrov and their team developed the SNP-Seek Database using 60 billion SNP reads, which provide SNP information by discarding all the indels in any of the genomic region to find out the structural genetic variations [33, 79].

5 Transcriptome Database

With the introduction of high-throughput technology like next-generation sequencing platforms, it is possible to identify the gene expression at the molecular level. Data on gene expression have proven invaluable to genome annotation programs [80]. The functional genomic study helps the researcher to find out the function of rice genes that control various traits. To study these gene functions, several databases are developed through bioinformatic tool where all the repository information of genes with specific trait of interest is available. For instance, TENOR (Transcriptome ENcyclopedia of Rice) database comprises of large-scale mRNA sequencing (mRNA-Seq) data extracted from rice. This database includes details on rice transcriptomes, such as transcript structures and expression profiles, as well as data on coexpression and data on cis-regulatory elements for each gene in 1 kb upstream regions. Since specifying the ability of plants to adapt to different growing conditions is a key issue in plant science, understanding the regulatory networks of genes associated with environmental changes is of great interest [81]. The team developed the database by using mRNA-Seq data under 10 abiotic stress condition such as high, low, and extremely low cadmium; high and low phosphate; high salinity; drought; osmotic; cold; and flood and under two conditions of plant hormone treatments (jasmonic acid and ABA). Earlier, Oono et al. used the TENOR database to identify the transporters for cadmium tolerance to study the gene expression under cadmium concentration [40]. TENOR offers three kinds of search systems. First, under various conditions, users can provide one or more transcript IDs for scanning functional annotations and patterns of expression. In addition, users can search genes for functional annotation keywords as well. This shows both partially and completely matched results. Second, the “GBrowse” genome browser helps users to check for transcripts with a transcript ID or genomic coordinate. In this search, the user can get all the information regarding the novel transcript structures viz. both annotated and unannotated with their characteristics. In the third search, the user can search for a collection of plant stress hormone-responsive genes through reactive expression patterns, defining the path of transition (suppression or induction), experimental circumstances, sampled tissues, and time points, in addition to fold change (FC) thresholds and statistical importance of changes in expression level.

Rice Expression Database (RED) is interactive rice gene expression profile database completely extracted from RNA-Seq data. RED provides a detailed list of 284 high-quality RNA-Seq results, includes a wide range of gene expression models, and encompasses a wide range of plant development stages. RED consists of a collection of genes unique to housekeeping and tissue and creates coexpression networks dynamically for gene(s) of interest. RiceArrayNet is another database that provides information in terms of correlation coefficients on coexpression between genes in rice. The correlation coefficient shows the coexpression pattern of genes in the rice genome [37]. Lee and the team developed this database that provides the correlation data in three different ways: First, gene coexpression is visualized in the form of cluster or network; second, the coexpression is visualized in scatter plot; and third, the gene coexpression is visualized in the histogram. Another recent Internet-based database for plant gene analysis is the “PLAnt co-EXpression database (PLANEX).” It includes freely accessible GeneChip details collected from the “Gene Expression Omnibus (GEO)” of NCBI. PLANEX is a database for genome-wide coexpression, which helps genes from a wide range of experimental designs to be functionally identified [42]. PLANEX describes “Pearson’s correlation coefficients (PCCs; r-values)” distributed for a specified microarray platform, contributing to a single organism from a gene of interest. The PLANEX database offers a correlation database, a cluster network, and an analysis of enrichment test results for eight plant organisms such as Arabidopsis thaliana, Triticum aestivum, Glycine max, Vitis vinifera, Hordeum vulgare, Solanum lycopersicum, Oryza sativa, and Zea mays. The cluster network of coexpressed genes is developed by PLANEX, which is calculated using the k-mean method. Genevestigator is another advanced Web-based framework developed to use modern data mining concepts and groundbreaking algorithms to conduct molecular expression meta-analysis. This database is focused on the systematic, large-scale combination of normalized and quality-controlled expression data with ontology-based experimental background variables such as anatomy, development, perturbation, or genetic background This large-scale combination of data and meta-data gives new insight into transcriptomes’ spatiotemporal response design and helps users to answer concerns that cannot be answered by evaluating a single experiment [50]. Other important gene expression databases are discussed in Table 3.1.

Table 3.1 Description of various database used in rice research

6 Protein Database

The protein database provides all the information of the gene that will encode protein. For functional genomics, proteome study related to genome sequence data is useful. These genome sequences help researcher to identify the genes that are expressed in protein developed through alternate splicing and post-transcriptional modification. Rice proteome study is possible through leaves, embryos, endosperms, roots, branches, shoots, and calluses, which provide the detailed mutation of gene through various environmental conditions. Many databases have been developed for proteome study in rice. For instance, OryzaPG-DB, a shotgun proteogenomic-dependent rice proteome database, integrates the genomic features of data from experimental shotgun proteomics. This version of the database was developed from the results of 27 nanoLC-MS/MS runs on a mass spectrometer of hybrid ion trap–orbitrap, providing high precision for the study of tryptic digests from undifferentiated cultured rice cells. Through searching the product ion spectra against the Michigan State University, protein, cDNA, transcript, and gene databases, peptides were detected and mapped to the rice genome. These peptides were occupied by approximately 3200 genes, and 40 of them incorporated novel genomic characteristics. The users may search, import, or browse the chromosome, gene, protein, cDNA, or transcript database and download the modified annotations in standard GFF3 format with PNG format visualization.

The “Mitochondrial Protein Import Components (MPIC)” database contains searchable details on the plant and non-plant mitochondrial protein import apparatus. An in silico study was performed to compare the mitochondrial protein import apparatus of 24 organisms representing different lines of Saccharomyces cerevisiae, algae to Homo sapiens and plants, including Oryza sativa, Arabidopsis thaliana, and other newly sequenced plant species. In the MPIC DB, each of these species has been thoroughly scanned and manually constructed for analysis. The database provides a user-friendly graphical map, enabling users to find their appropriate import component. The MPIC DB offers a robust database to promote thorough investigation of mitochondrial protein import machinery and to identify conservation and divergence patterns that might have been skipped [64].

Manually Curated Database of Rice Proteins (MCDRP) is a database for rice protein mainly focused on the reported experimental data [68]. Another database, namely the “Database of Interacting Proteins in Oryza sativa (DIPOS),” offers detailed knowledge on interacting proteins in rice, where two statistical approaches are used to model interactions, i.e., interologs and domain-based methods. Of 27,746 proteins, DIPOS comprises 14,614,067 pair-related associations, covering around 41 percent of the entire proteome of Oryza sativa [66].

7 Gene Ontology and Pathway Database

To broaden our understanding of biological processes in plants and to explain how biological functions develop, it is important to recognize certain diversified pathway. An application of reference and species-specific ontologies for plants and annotations to genes and phenotypes are provided by the Planteome project (http://www.planteome.org). Ontologies serve as common standards for the semantic integration of data from plant genomics, phenomics, and genetics from a large and growing dataset. There Plant Ontology, Plant Trait Ontology, and Plant Experimental Conditions Ontology, developed by the Planteome project, together with Gene Ontology, Biological Interest Chemical Entities, Phenotype and Attribute Ontology, and others, are the reference ontologies [69]. The platform also offers access to species-specific crop ontologies established around the world by different plant breeding and research communities. Out of 95 plant taxa, annotated with reference ontology terms, developers offer integrated data on plant traits, phenotypes, and gene function and expression. In order to facilitate community engagement, the Planteome project has developed a plant gene annotation platform, Planteome Noctua. All Planteome ontologies are freely accessible and are managed for sharing and monitoring revisions and new queries at the Planteome GitHub platform (https://github.com/Planteome). From the ontology browser (http://browser.planteome.org/amigo) and our data archive, the annotated data are readily available.

RiceCyc is another directory of rice metabolic pathway network [82]. It is a glimpse of the main and intermediate metabolism of substrates, enzymes, metabolites, reactions, and pathways in rice. Version 3.3 of RiceCyc contains 316 pathways and 6643 peptide-coding genes mapped to 2103 enzyme-catalyzed and 87 transport reactions regulated by protein. Annotations given by the KEGG and Gramene databases enriched the original functional annotations of rice genes with InterPro, Enzyme Commission (EC) numbers, MetaCyc, and Gene Ontology. Employing the Pathologic module of Pathway Software, pathway inferences and network diagrams were first predicted based on MetaCyc reference networks and plant pathways from the Plant Metabolic Network. This was enriched by manually inserting metabolic pathways and explicitly reported gene functions for rice. From pathway diagrams to the relevant genes, metabolites, and chemical structures, the RiceCyc database is hierarchically browsable. Users may also upload transcriptomic, proteomic, and metabolomic data to visualize expression trends in a simulated cell using the OMIC Viewer integrated application. RiceCyc enables comparative pathway research, coupled with additional species-specific pathway databases housed in the Gramene project [82]. Another database, namely IntAct, offers a publicly distributed, open-source information structure and molecular interaction data review. All interactions are produced through curation of literature or direct contributions from users and are publicly accessible (https://www.ebi.ac.uk/intact/home.xhtml).

8 Conclusion and Future Perspectives

In conclusion, we attempted to catalog numerous Web data services and tools available for rice research. Some of them, although few are recent and small-scale repositories, are well known and commonly utilized. It is obvious with the growing number of databases that there is an overwhelming amount of data accessible on the site, connected to almost any area of rice science. However, it has not been effectively investigated despite possessing such a vast amount of diverse data, since many biology researchers or prospective consumers are unfamiliar with all the possible tools to find and interpret the data [83]. Different databases often have different data exchange formats and protocols, which makes it difficult to integrate them into one place. In an ideal situation, a single platform should be available for all databases in a single domain of interest, where a user can use APIs and ontologies to search all the respective databases with a single query and compare the results; e.g., Araport [84] is one such initiative. In order to improve the legitimacy of their data, certain databases are now merging connections to other databases of similar types of data, which is the first move in offering a single forum. This maximizes the utilization of usable data in current assets, which may aid in the prevention of duplication. It provides small databases with greater coverage and may offer a broader image collectively, as small databases typically concentrate on one particular element and provide comprehensive details.