Keywords

1 Introduction

A comprehensive review of global demographic trends and future prospects published in the World Population Prospects: The 2017 Revision, released by the United Nations Department of Economic and Social Affairs (UN DESA), predicted the current world population of 7.6 billion to grow each year by roughly 83 million people (https://population.un.org/wpp/). Hence the global population is projected to reach 8.6 billion in 2030, 9.8 billion in 2050 and 11.2 billion in 2100. Within this period from 2017 to 2050, half of the global population growth is expected to be concentrated in a small number of countries, and thus such a status in the poorest countries will present a significant challenge to the respective governments, which strive to implement the 2030 Agenda for Sustainable Development. The nine countries listed in the review that are expected to make a significant contribution to the global population are Indonesia, Uganda, the United States of America, the United Republic of Tanzania, Ethiopia, Pakistan, the Democratic Republic of the Congo, Nigeria and India. The unavoidable question that arises in the face of this global crisis is that how to ensure access of safe, nutritious and sufficient food to all the people?

1.1 Rice: The Right Choice

Rice is a very old crop plant, with a long evolutionary and cultivation history. Since time immemorial, it is regarded as an excellent staple crop, as cooking of rice is very simple in contrast to bread making which requires fermentation by yeast. Not only is rice a rich source of carbohydrates but also contains all the essential amino acids except lysine. Rice is mainly harvested and consumed in Asia and Africa, but at a global level, rice is the staple crop for half of the world’s population and constitutes nearly 50% of the daily calorie intake. Similar to religion or tradition, rice is deeply ingrained in our lives.

The genus Oryza contains 27 species, 2 of which were domesticated independently ~6000 years apart (Wing et al. 2018). The domestication of the Asian rice (Oryza sativa) and African rice (Oryza glaberrima) occurred on two different continents, at different time points in the evolutionary history of rice. Therefore, rice has an extensive germplasm collection of wild and domesticated species, with some extinct wild species which were progenitors of the present-day rice. Thus the present-day picture of rice evolution is quite complex, as it emerges that divergent ancestral populations gave rise to extant rice populations (Civáň et al. 2015). One or more de novo domestication events may have led to the different varietal groups that exist today, e.g. Oryza sativa subspecies, japonica and indica.

Since its domestication and adaptation, rice has been cultivated for thousands of years in the Old World and for hundreds of years in the New World (Wing et al. 2018). Throughout this duration, rice has undergone significant improvement, adapting to range of geographical locations, soil textures, climate or environmental conditions and also cooking preferences. These have also led to positive selection of agronomically desirable traits, which are still applied to current-day breeding programmes. Identifying these desirable traits and crossing plants to develop new varieties with high-yielding phenotype or with resistance towards adverse climatic conditions are already known. For achieving higher yield and accelerated growth, hybrid vigour or heterosis has also been exploited, where genetically distinct parents are crossed to produce a hybrid offspring with an improved or enhanced function of a desirable biological quality. An early example of rice improvement by hybridisation, hailing from the Neolithic age, is the origin of O. sativa varietal group indica (Wing et al. 2018). Introgression of genomic regions from subspecies japonica to indica led to integration of a number of agronomically desirable genes, for example:

  1. (a)

    SH4: non-shattering allele, enhances sustainability of mature seeds on the panicle.

  2. (b)

    RC: colour allele, imparts the white grain colour which is desirable.

  3. (c)

    PROG1: important for erect growth of the crop, imparts lodging resistance.

Interestingly the SH4 orthologue in African rice is associated with mutation, which drives the selection for the non-shattering phenotype in this domesticated crop (Wang et al. 2014). Other genes regulating plant height, grain quality, disease and stress resistance, fertility, nutrient uptake and cooking or eating quality were eventually incorporated as a result of geographical spread and adaptation. Genome-wide association studies have revealed genomic regions that have undergone selection related to geographical adaptation and identified genes of agronomic potential (Meyer et al. 2016) (summarised in Table 1).

Table 1 List of genes that impart agronomically desirable traits and were positively selected during domestication and diversification of cultivated rice

With the ever-increasing trend of the global population and the current status of food resources, it is clear that rice will play a crucial role to fulfil the global demands. But increasing quality and quantity only by implementation of traditional crop improvement programmes will undermine the crucial contribution that this cereal can deliver in this lingering crisis. Also there are additional factors associated with breeding of sustainable crops that cannot be ignored (summarised in Fig. 1). With population growth, constant urbanisation and increasing misuse of land, there is an evident dearth of cultivable land. Excessive use of resources such as fertilisers, pesticides and water has had adverse effects on the environment. Increasing rice production by developing high-yielding varieties therefore necessitates incorporating other attributes such as reducing dependency on fertilisers and pesticides, increasing efficiency of water and nutrient use, enhancing tolerance to biotic and abiotic stresses (e.g. disease, drought, salinity, temperature, etc.), being able to grow on limiting lands and reducing emissions of greenhouse gases (Zhang 2007; Roychoudhury et al. 2013). Other qualities that will contribute to the better acceptability of rice as a staple diet include improving plant architecture (e.g. number of panicles, number of grains per panicle, grain weight, more productive tillers, spikelet number, uniformly mature kernels, free of empty or half-filled grains); manipulating flowering time, spikelet fertility or heading date; and improving the appearance (uniform size and shape, colour, translucency), milling (high milling recovery, high head rice recovery, no discoloration), cooking (amylose content, gel length and gelling temperature) and nutritional (both micro- and macronutrients) quality of the rice grain (Zhang 2007). To achieve this, scientists and researchers, over the years, had to leverage genomics to obtain a thorough understanding of the genetic components that regulate the abovementioned attributes. Hence this was preceded by deciphering the entire rice genome sequence.

Fig. 1
figure 1

Expected deliverables from a typical present-day crop improvement programme that aims for breeding of sustainable rice crop

Rice is the first cereal crop, the first grass species and the first monocot genome to be sequenced. Rice was an attractive target for genome sequence analysis because it has a relatively small genome (430 Mb, 2n = 24) compared to that of other common cereals like sorghum (750 Mb), maize (3000 Mb), barley (5000 Mb) and wheat (16,000 Mb). Cereal genomes are highly conserved and exhibit considerable synteny among the genes (Freeling 2001). Rice, with a small genome size, was predicted to have higher gene density compared to the other candidates. Moreover, rice has an elaborate germplasm of wild and domesticated species. Also, once genetic markers were identified, they had to be tinkered with to yield the desirable phenotypes, and rice was an easy candidate for genetic manipulation. Therefore, rice emerged to be an excellent model for plant genomics, following Arabidopsis.

The following sections aim to highlight how the International Rice Genome Sequencing Project (IRGSP) was conceived as well as its contribution and impact on the present-day understanding of rice genomics. Recently, resequencing efforts through high-throughput platforms and mapping this data into high-quality reference sequence genomes is helping to identify novel targets for genomic breeding and making development of Green Super Rice (GSR) varieties scientifically and technically feasible.

2 Mapping the Rice Genome

The profound importance of rice as a food crop and as a model plant by virtue of its syntenic relationships with other cereal crops has significantly stimulated rice genome analysis. Genome analysis in rice comprised of the following goals: (1) mapping of rice genome, (2) sequencing the rice genome and (3) functional association of the mapped genes. Large volumes of structural and functional data of the rice genome were generated in the twentieth century which facilitated mapping of the rice genome. This created a strong foundation for subsequent sequencing of the rice genome, which was possible with major advancements in sequencing strategies. Initial efforts included cytogenetic mapping, linkage mapping, genetic mapping and physical mapping, which have been utilised extensively to build sequence-ready physical map of the rice genome (Tyagi et al. 2004).

  1. 1.

    Cytogenetic maps conventionally were based on microscopic examination of chromosomes. With the advent of novel effective techniques like fluorescence in situ hybridisation (FISH) and genomic in situ hybridisation (GISH), an enhanced-quality rice chromosome map could be generated (Heng et al. 1997). GISH was used extensively to dissect the genomic constitution of rice somatic hybrids and tetraploid species, as this technique could efficiently identify a chromosome complement belonging to a particular genome (Fukui et al. 1997). FISH, on the contrary, allowed mapping of restriction fragment length polymorphism (RFLP) markers, rDNA loci, bacterial artificial chromosome (BAC) and yeast artificial chromosome (YAC) clones on the respective rice chromosomes (Jiang et al. 1995).

  2. 2.

    Khush and Brar proposed 12 linkage groups which corresponded to the haploid number of chromosomes in rice (Khush and Brar 2001). These linkage groups were associated with the rice chromosomes by primary trisomics and reciprocal translocations, while centromere positions and orientation of the linkage groups were determined using secondary trisomics and telotrisomics. All these efforts culminated into a comprehensive molecular linkage map of rice.

  3. 3.

    The advent of molecular genetic markers provided significant impetus to the construction of genetic maps, which were being eventually supplemented with phenotypic markers. Several genetic maps of rice were being constructed from RFLP, amplified fragment length polymorphism (AFLP), random amplified polymorphic DNA (RAPD), cleaved amplified polymorphic sequence (CAPS) and microsatellite or simple sequence length polymorphism (SSLP) markers (Mohan et al. 1997). Most of these markers were developed from expressed sequence tag (EST) clones derived from cDNA libraries of rice.

  4. 4.

    The immense progress that was feasible while constructing the physical maps was accomplished mainly due to the availability of libraries of large insert clones in vectors like BAC, YAC and P1-derived artificial chromosomes (PACs) and the information about DNA markers on genetic maps. The YAC clones were originally mapped onto the rice chromosomes by chromosome landing, integrating information from the genetic maps. Such integrated YAC physical map was subsequently used for positional cloning of several genes (e.g. rice blast resistance gene, Pib; bacterial blight resistance gene, Xa-1; and gibberellin-insensitive dwarf mutant gene, d1), for the assignment of chromosomal locations of more than 6000 EST markers to generate a rice transcript map and also has been utilised as a backbone for construction of PAC- and BAC-derived physical maps (Yoshimura et al. 1996; Ashikari et al. 1999; Wang et al. 1999; Wu et al. 2002). Tao et al. (2002) developed one large insert plant-transformation-competent BIBAC library and three BAC libraries for the temperate japonica rice cv. Nipponbare, which significantly aided functional analysis of the genome, comparative genomics of grass species and subspecies and molecular breeding in rice and other cereals (Tao et al. 2002).

  5. 5.

    With constructions of these abovementioned maps, the next approach was to integrate all the information obtained from the cytogenetic, genetic and physical maps to constitute a comprehensive physical-genetic map, which will cover a large fraction of the genome and facilitate the study of organisation and functional aspects of the genome. For instance, a standardised rice karyotype was constructed from meiotic pachytene chromosomes of O. sativa spp. japonica rice cv. Nipponbare, using centromere-specific DNA probes and chromosomal arm-specific BACs, following which it was fully integrated with the most saturated rice genetic linkage maps in which Nipponbare was used as one of the mapping parents (Cheng et al. 2001a). A physical-genetic map of rice chromosome 10 was developed by integrating the pachytene chromosome-based FISH mapping of BAC clones with a genetic linkage map, which revealed the precise genetic position of the centromere on chromosome 10 and reflected on the genetic recombination frequencies among the chromosomal arms (Cheng et al. 2001b).

2.1 Rice Genome Research Program (RGP), Japan, Established 1991

In 1991, the Japanese government initiated its Rice Genome Research Program (RGP). The aim of this programme was to clarify the genome sequence of rice in entirety (Sasaki 1998). The RGP was implemented in two phases: the first phase of 7 years dealt with the mapping of the rice genome which included large-scale cDNA analysis, genetic mapping by construction of fine-scale RFLP map and YAC-based physical mapping; the second phase was initiated in 1998, when RGP stepped into the new era of genome sequencing and matured into the International Rice Genome Sequencing Project (IRGSP). This section highlights the first phase when the main motive was to construct a genome-wide physical map. This phase was coupled with advances in the field of computation in terms of next-generation web technologies and bioinformatics analysis platforms, which became extremely important for scientists and bench biologists to visualise, annotate and analyse the high-throughput experimental data.

2.1.1 Linking Genomics and Genetics

RGP adopted cDNA analysis, a quick and easy strategy to clone several genes expressed in rice. This involved random cloning, partial sequencing and developing cDNA libraries from various tissues at different developmental stages, for example, green and etiolated seedlings, young roots, panicles at the flowering stage and calluses cultured with 2,4-dichlorophenoxyacetic acid (Yamamoto and Sasaki 1997). By 1997, 36,000 cDNA clones from 15 main cDNA libraries were sequenced for 400–500 5′-terminal bases (Sasaki 1998). Additionally, more than 40,000 cDNA clones were partially sequenced to generate ESTs. Therefore, sequencing the cDNA from these samples helped: (1) to designate ESTs for genomic regions of expressed genes, (2) to understand the expression profile of genes from various tissues at separate growing stages and (3) to functionally annotate the genes depending on similarity search in public databases. Generating ESTs was of profound importance as it was exploited for linkage analysis by utilising RFLP markers and also for determining hinge markers for YAC contigs. So to link the information in the DNA sequence to phenotypic traits, the basic tool required for rice genome analysis is a genome-wide physical-genetic map.

At the time, a high-density genetic linkage map based on polymorphisms within DNA sequences, such as RFLPs, CAPSs and simple sequence repeats (SSRs), was derived from F2 plants from a single cross between the japonica variety Nipponbare and the indica variety Kasalath (Harushima et al. 1998). It was composed of 2275 genetic markers with a cumulative genetic distance of 1550 cM for 12 linkage groups. Nearly 70% of the DNA markers were rice ESTs (clones from Nipponbare callus, root and shoot libraries), while the remaining were clones from genomic DNA and cDNA of rice and other cereal crops. The position of the centromere, assigned using the secondary trisomics and telotrisomics developed at the International Rice Research Institute (Philippines), revealed the chromosomal orientation of individual linkage groups and suggested that the meiotic recombination frequency is very low at each centromere but reasonably high in the flanking regions (Singh et al. 1996). Additionally the clarity on the knowledge of DNA markers on the genetic map was used for accurate genotyping of candidate progenies obtained by backcrossing for a particular trait (Yano et al. 1997).

A physical map assembled using YACs was also available. The YAC library was derived from Nipponbare variety and comprised of ~7000 clones with an average insert size of 350 kb (Umehara et al. 1995). Thus the library size was 5.5 times of the rice genome (430 Mb), indicating overlapping YAC clones. Also chimerism (noncontiguous DNA fragments present within the same YAC clone) and difficulty of separating YACs from other yeast chromosomes posed challenges in using YACs as templates for DNA sequencing. This redundancy was removed by the following strategies: (1) colony hybridisation of YACs using RFLP markers, (2) using sequence-tagged site markers for identifying positive YACs and (3) mapping many ESTs onto YAC physical map, by assembling them using primers designed from the 3′ untranslated region (UTR), specific for each gene (Antonio et al. 1996, Saji et al. 1996, Shimokawa et al. 1996, Umehara et al. 1996, 1997, Wang et al. 1996, Koike et al. 1997, Tanoue et al. 1997). BAC/PAC vectors were also used extensively to construct new rice genomic libraries (Budiman et al. 1999; Baba et al. 2000). Clones were generated using several combinations of restriction enzymes (e.g. Sau3AI for PACs and HindIII and EcoRI for BACs), partially sequenced from 5′ and 3′ termini and subsequently fingerprinted to develop a BAC physical map. Similarly as mentioned before, PAC contigs were identified by EST and confirmed by fingerprinting. Sequencing of the selected PAC contigs resulted in a significant coverage of the rice genome and indicated a higher gene density than expected.

The combination of ESTs obtained from extensive cDNA analysis, DNA markers identified on RFLP linkage maps and the information derived from the physical maps generated from YAC clones and BAC/PAC libraries indicated to a large extent the exact location of the expressed genomic region. A properly annotated genome-wide physical-genetic map was the need of the hour, such that each locus, supplemented with positional information on the chromosomes, could be associated with a phenotypic trait, thereby facilitating the cloning of biologically and agronomically important genes. RGP was a prerequisite for generating a reliable sequence-ready physical map. These efforts not only prepared the foundation for rice genome sequencing, but triggered and inspired the entire sequencing project that was to be undertaken in the subsequent years worldwide (Sasaki 1998; Sasaki and Burr 2000).

2.1.2 Genomics Meets Informatics

The sequences that were generated by the concerted efforts were subsequently annotated by searching with various software and prediction tools against databases. The following approaches were adopted to generate a final annotation of genes and associated elements, along with assigning their proper coordinates in the genome sequence: (a) BLASTX to search the nonredundant protein database, (b) BLASTN to search the rice EST database, (c) GENSCAN to predict open reading frames (ORFs) within the sequence, (d) SplicePredictor to identify potential exon-intron splice sites, (e) Miropeats to predict inverted or tandem repeats and assemble the shotgun sequence and (f) using gag and pol genes as references to recognise transposable elements, followed by identification of long terminal repeats (LTRs) (Parsons 1995; Kleffe et al. 1996; Altschul et al. 1997; Burge and Karlin 1997). Similarly other tools such as Gene Finder, GeneMark and NetPlantGene were also employed (Hebsgaard et al. 1996; Solovyev and Salamov 1997; Lukashin and Borodovsky 1998). Even after integration and correlation of all this genetic information with the genome sequence, powerful information retrieval system was necessary to easily access and manipulate the data and link this information to other genetic resources.

A new rice genome database, called INtegrated rice genome Explorer (INE, pronounced as ‘i-ne’, also referring to rice plant in Japanese language), was developed to accelerate the release of all this genetic information to the public (Sakata et al. 2000). This web interface based on a Java applet allowed rapid viewing capability of the integrated maps, accompanied by smooth navigation options. INE incorporated and integrated large data volumes from various experimental outputs to enhance the overall density of the markers within the genome. The high-density linkage map with 2275 DNA markers also included the image of each probe used for RFLP and the sequence of the corresponding markers (Harushima et al. 1998). The physical map generated using the YAC clones covered a significant portion of the genome and also showed the actual physical distance between genetic markers. Incorporation of ESTs in the INE allowed review of syntenic relationships and also enhanced the significance of high-density markers. Also a physical map constructed from sequenced PAC clones was incorporated to enhance the reliability of the data. Additional quality control was also applied to ensure good-quality data. INE provided a page for accessing ‘Low Quality Information’, which harboured sequences that failed to meet the standards or any thresholds set up by RGP. INE not only increased its extent of data incorporation but also allowed integrated view of the data for efficient data mining. Each chromosome was associated with an integrated map which showed the linkage map, the physical map constructed from ordered YAC clones, the EST map and the PAC contigs. Therefore the value of genetic information was greatly enhanced with this integrated display. The following examples corroborate the usefulness of the genetic map: by integrating the PAC contig map or an EST map into the linkage map, (1) the existence of a PAC or EST adjacent to the genetic marker of interest was verified, (2) the exact positions of genetic markers and other genetic elements were evaluated, and (3) functional annotation of a PAC near a genetic marker revealed an EST within a PAC contig responsible for a particular trait. These would facilitate subsequent map-based cloning of agronomically desirable genes. INE was later modified to meet the demands of the genome sequencing project (Sasaki and Burr 2000). Upon further elucidation of signal transduction pathways in rice, more valuable information on the physiological and biochemical aspects of rice genes was incorporated into INE. For exploration of syntenic relationships of rice with other cereal crops, INE was subsequently linked to other genomic resources of important cereal crop species.

3 Sequencing the Rice Genome

The genome-wide physical-genetic maps could leverage genomics if the nucleotide sequence of the rice genome was available. In fact, the DNA sequence can bridge the gap between the structural organisation of the genome and its functional dimensions. Towards the end of the twentieth century, with the progress in sequencing methodologies, adoption of automation and advances in computation, several genome sequencing projects of higher organisms were undertaken, including Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana and Homo sapiens (Mewes et al. 1997; The C. elegans Sequencing Consortium 1998; Adams et al. 2000; The Arabidopsis Genome Initiative 2000; International Human Genome Sequencing Consortium 2001; Venter et al. 2001). In that new era of sequence-based genomic research, the obvious choice for a genome sequencing project on a cereal crop was rice. By then the rice genome was well mapped because of the decade-long extensive research on rice genomics, thereby providing a solid foundation for the upcoming sequencing project. Among the major cereal crop genomes, rice has the smallest genome of 430 Mb. Moreover, rice serves as an excellent model as it shares a large extent of synteny with other grass genomes, like rye, wheat, maize, barley and sorghum, and can be easily transformed and genetically manipulated (Tyagi and Mohanty 2000). With the establishment of the International Rice Genome Sequencing Project, in the second phase of RGP, rice became the first crop and the first monocot genome to be sequenced. The following section discusses the inception of IRGSP, inclusion of the participating nations, adoption of optimal methodologies for sequencing, choice of rice cultivar, nucleotide sequence accuracy, finishing standards and sequence release policies.

3.1 International Rice Genome Sequencing Project (IRGSP), Established 1997

The International Rice Genome Sequencing Project (IRGSP) was initiated at a workshop in September 1997, at the International Symposium on Plant Molecular Biology held at Singapore. Scientists and representatives from several nations attended the workshop and agreed to collaborate internationally to sequence the rice genome (Eckardt 2000). To avoid any instance of allelic polymorphism, a single variety of rice cultivar was used as the common source of DNA: Oryza sativa ssp. japonica cultivar Nipponbare. Participants from Japan, Korea, China, the United Kingdom and the United States decided to share and contribute materials and agreed to release physical maps and DNA sequences to the public databases in a timely manner. The IRGSP eventually evolved to include 11 nations (listed in Table 2) and also revised their policies and finishing standards to ensure a high-quality sequence-ready genome map of rice. Like other large-scale genome sequencing projects accomplished at the time, the modus operandi for this project was similar (Sasaki and Burr 2000; International Rice Genome Sequencing Project and Sasaki 2005). Large insert genomic libraries constructed in BACs and PACs were used as the primary sequence templates. Using polymerase chain reaction (PCR) screening, fingerprinting, end-sequencing and physical mapping, minimal tiling paths were constructed such that a contiguous set of minimally overlapping BAC/PAC clones can be anchored to physical positions along the length of the chromosome, thereby generating a sequence-ready BAC/PAC contig. These clones belonging to a contig were subjected to a map-based hierarchical clone-by-clone shotgun approach, to produce shotgun libraries, which were then sequenced and assembled using in silico tools to reconstitute the entire intact insert within the corresponding BAC/PAC clone (see Fig. 2). The second focus of this project was annotating the raw nucleotide sequence data and associating useful biological information with it, for example, positioning of genes, prediction of splice sites, transcriptional start sites and regulatory sites, repeat elements, prediction of tRNAs, analyses of the predicted proteins to dissect any functional domains or motifs, etc. Therefore this mammoth task, inclusive of so many dimensions, finally culminated into a high-quality genome sequence of rice. The following sections discuss the methodologies adopted by IRGSP and some significant outcomes of the rice genome analysis.

Table 2 Chromosome assignments for sequencing for IRGSP participants
Fig. 2
figure 2

Schematic workflow of two distinct shotgun sequencing approaches used in rice genome sequencing: hierarchical clone-by-clone approach was adopted by IRGSP and Monsanto (left), while whole-genome shotgun approach was employed by Syngenta (right)

3.1.1 Physical Map and Sequencing

The hierarchical clone-by-clone methodology adopted by IRGSP used a high-density genetic linkage map generated using a single F2 population (2275 markers), ESTs, YAC- and BAC-based physical maps and 2 draft sequences from 2 separate private sources, Monsanto (now Pharmacia, New Jersey) and Syngenta (San Diego) (Yamamoto and Sasaki 1997; Harushima et al. 1998; Mao et al. 2000; Barry 2001; Saji et al. 2001; Chen et al. 2002; Goff et al. 2002; Wu et al. 2002). In total, nine genomic libraries from Oryza sativa ssp. japonica cultivar Nipponbare were utilised to establish the physical map (International Rice Genome Sequencing Project and Sasaki 2005). Marker-aided PCR screening, fingerprinting with restriction enzymes and end-sequencing of approximately 3400 BAC/PAC clones were used to construct the minimal tiling paths. Majority of the physical gaps in the BAC/PAC tiling path were filled with the help of PCR fragments, 10-kb plasmids and 40-kb fosmid clones. A typical workflow with BAC clones included the following steps: (1) fingerprinting of HindIII and EcoRI BAC libraries, (2) assembly of the fingerprinted BACs into contiguous contigs, (3) anchorage of these BACs onto the physical map with RFLP and end-sequencing analysis and, finally, (4) connection and extension of contigs by chromosome walking. Another technique that was adopted for extending contigs was the use of ‘overgo probes’ (Eckardt 2000). 24-bp sequences were designed from BAC end-sequences with an 8-bp overlap. The 24-bp sequences were then joined to create a 40-bp ‘overgo’, which was then used to probe a high-density BAC library in order to search for additional BAC clones that may potentially extend a contig. Using similar workflows, plant ESTs were also mapped onto the physical map of rice chromosomes. The PAC, BAC and fosmid clones on the physical map were next subjected to a shotgun sequencing approach, originally employed by RGP. In this procedure, the extracted DNA from individual PAC/BAC clones (ranging between 100 and 200 kb) belonging to a sequence-ready contig was subjected to random shearing by sonication or nebulisation, following which the fragments were subcloned to produce shotgun libraries with an average insert size of 1 to 3 kb (see Fig. 2). Random clones from the shotgun libraries were then sequenced, using both universal primers and the dye-terminator or dye-primer methods, to acquire the desired degree of ‘coverage’ of the total sequence. The shotgun sequencing approach finally led to an assembled, ordered and finished quality rice genome, with approximately tenfold sequence coverage and less than 1 error per 10,000 bases. As mentioned above, small physical gaps were bridged by long-range PCRs. FISH and optical mapping were the other two techniques that were adopted for the remaining gaps. The profound application of FISH to rice mapping was already well established (Jiang et al. 1995; Cheng et al. 2001b). FISH was established as a useful technique for easy identification of rice chromosomes, determination of physical positions of uncertain clones and examination of the physical nature of large linkage gaps. The last feature facilitated sequencing at chromosomal ends, centromere and telomere regions and turned out to be effective for characterising BAC clones that contain complex repetitive DNA sequences, very common in rice (Moore et al. 1995). The optical mapping technique was previously used to create whole-genome restriction maps of the microorganisms Deinococcus radiodurans and Plasmodium falciparum (Lin et al. 1999; Lai et al. 1999). In this technique, fluid flow capillary action was employed to extend and align DNA molecules onto a specially prepared glass surface. DNA was then digested with restriction enzymes, and fluorescence microscopy imaging was finally used to map the fragments into an ordered array. Typically, charting a minimal tiling path (i.e. a subset of clones with minimally overlapping sequences) which encompasses a chromosome or a region of interest along a chromosome is dependent on the physical map and the assembly of sequences from a selected subset of clones in an unambiguous fashion with help from their overlapping regions. Unfortunately, the centromeric regions of majority multicellular eukaryotic organisms emerge to be resistant to this method, largely because they contain highly repetitive satellite DNA sequences and transposable elements (Eckardt 2004). The recalcitrance of these regions impeded the efforts to understand their functionality. Findings from previous cytological work indicated that rice centromeres contain multiple repeats of a 155-bp satellite DNA sequence called CentO and many rice chromosomes limited amounts of the satellite DNA repeat sequences compared to other species (Dong et al. 1998; Cheng et al. 2002). The centromere of rice chromosome 8 (Cen8) had the least amount of satellite DNA sequence (∼64 kb) among the 12 rice chromosomes and was an obvious choice for obtaining the sequence for this region. Two groups put significant efforts towards this goal: Wu et al. (2004) employed repeated subcloning, transposon-based sequencing and fingerprinting of BAC/PAC clones, while Nagaki et al. (2004) used similar methods, mapped the Cen8 region using CentO repeat sequences and centromere-specific retrotransposon sequence probes and constructed a minimal tiling path of 12 BAC clones encompassing this region (Nagaki et al. 2004; Wu et al. 2004). Breakthrough findings reported from two groups indicated a large fraction of the Cen8 sequence to contain known repetitive elements, like inverted repeats, transposons, gypsy-type retrotransposons, CentO satellite repeats, etc. Therefore to fill in the large physical gaps arising from the centromeric regions in the sequences, they were estimated on the basis of the CentO satellite DNA content of the respective chromosomes (International Rice Genome Sequencing Project and Sasaki 2005). The length of CentO arrays therefore provided approximate measures of centromere gaps, telomere gaps and chromosomal arms.

3.1.2 Integration of Draft Sequences from Monsanto and Syngenta with IRGSP

In early 2000, Monsanto announced that the company was set to release a draft sequence of the rice genome. The sequences would be made available to academic scientists under an access agreement with Monsanto. A separate agreement between the company and IRGSP facilitated incorporation of the Monsanto BAC clones and sequences into the IRGSP sequences. The Monsanto rice genome sequencing project also undertook the BAC-to-BAC sequencing approach on ~3400 BAC clones of the cultivar Nipponbare (Barry 2001). The draft sequence represented 393 Mb of the rice genome, with ~5X coverage. This development significantly accelerated public efforts to sequence the genome. Drafts of the rice genome sequence from ssp. japonica cultivar Nipponbare were completed and published separately by Syngenta (San Diego) in 2002 (Goff et al. 2002). This draft, generated using whole-genome shotgun sequencing approach, represented 93% of the rice genome (~390 Mbp), with ~6X coverage, and a 44% GC content. The Syngenta draft sequence, also referred to as Syd, provided useful insights into the rice genome and proteome. Gene predictions on the assembled sequences revealed 32,000 to 50,000 genes contained by the genome, with homologs in maize, wheat, barley and other cereal crops. Extensive synteny and gene homology was also indicated between rice and other cereal genomes. Another draft sequence of the O. sativa subsp. indica cultivar 93–11 was released by the Beijing Genomics Institute (BGI), which was also generated through a whole-genome shotgun sequencing approach (Yu et al. 2002). This represented ~360 Mb of the genome and emerged to be an important resource for gene discovery, comparative genomics, syntenic associations and SNP discovery. BGI sequence analysis suggested that the rice genome encodes for 46,000–55,000 proteins, which was consistent with the estimate made from the Syd data. The BGI data were made publicly available through GenBank/DDBJ/EMBL, unlike Monsanto or Syngenta. Contigs from BGI and Syngenta whole-genome assemblies were aligned against the IRGSP Nipponbare pseudomolecules using MUMmer (Delcher et al. 1999). The number of IRGSP full-length cDNA-supported gene models that aligned with the contigs was tabulated. To estimate the location of centromeres, the 155-bp CentO consensus sequence was used as BLAST query against 93–11 and Nipponbare whole-genome shotgun contigs, and their coordinates were recorded (International Rice Genome Sequencing Project and Sasaki 2005). The CentO sequence positions on each IRGSP pseudomolecule for a particular chromosome served as the benchmark to estimate the location of centromeres for each indica chromosome.

3.1.3 Assembly and Annotation

Fingerprinted BACs were assembled into contigs using the software called FingerPrinted Contigs (FPC) (Soderlund et al. 2000). The shotgun sequences were assembled into intact inserts within the BAC/PAC clone using PHRED, PHRAP (http://www.phrap.org/phredphrapconsed.html) and CONSED software packages or using the TIGR Assembler (Sutton et al. 1995; Ewing et al. 1998; Ewing and Green 1998; Gordon et al. 1998). The primary goal of IRGSP was to obtain a complete and accurate sequence of the rice genome. Consistent with the Bermuda sequencing standards (the international human genome sequencing community organised meetings in Bermuda in 1996 and 1997 to set finishing standards for DNA sequences), ‘complete’ was initially defined as less than 1 error in 10,000 bases (Eckardt 2000; International Rice Genome Sequencing Project and Sasaki 2005). An empirically determined PHRAP score of 30 or above was found to be consistent with the level of accuracy. Sequence ambiguities, therefore, were indicated by PHRAP scores less than 30 and were usually regions containing repetitive DNA sequences. These problem regions were resolved by changing sequencing chemistries or using alternate polymerases. Assemblies of BAC and PAC clones were checked for accuracy by comparing the experimentally determined fingerprint patterns with computationally derived patterns of restriction enzyme digests. Sequence quality was also tested by comparisons of overlapping sequences obtained independently.

Another important step in the rice genome sequencing project was the construction of pseudomolecules, which were generated by using an in-house, database-based, semiautomated process (http://rice.plantbiology.msu.edu/). These are virtual contigs that are constructed by selecting an optimal tiling path of BAC/PAC clones representing all the 12 chromosomes and by resolving discrepancies between overlapping BAC/PAC clones. Overlap regions were trimmed in accordance with information from similarity searches and tiling path information, while physical gaps were represented by a stretch of 1000 nucleotides. The overlapping clones were also validated by comparing the pseudomolecule sequence with the optical map for rice. The unique sequences were then linked to form a contiguous sequence for each rice chromosome. Location of centromeres was ascertained on the basis of gaps and presence of the CentO repeats. Information on gene models and features associated with the BAC/PAC clones were transferred to the pseudomolecules, which underwent quality control using the whole-genome shotgun sequences of Nipponbare. All BAC/PAC clones were not incorporated into the pseudomolecules due to different reasons: some belonged to subspecies indica, some clones were unfinished or redundant or overlapping, some were released later than the release of the version of pseudomolecules, while some were difficult to map to their correct positions. Sequences from BACs and the Syngenta whole-genome shotgun assembly of Nipponbare that could not be aligned on the existing pseudomolecules were represented on separate pseudomolecules, the Unanchored and the Syngenta chromosomes. Release of this assembly of pseudomolecules, termed Os-Nipponbare-Reference-IRGSP-1.0, was made in collaboration with the Agrogenomics Research Center at the National Institute of Agrobiological Sciences, Tsukuba, Japan. The annotation files for the pseudomolecules exclude partial genes, pseudogenes and small gene models (<50 amino acids).

Many bioinformatics resources were used to leverage sequence annotation of the rice genome. Several gene prediction software like GENSCAN, FGENESH, RiceHMM, GlimmerM and MZEF were employed to predict gene models (Burge and Karlin 1997; Zang 1997; Sakata et al. 1999; Salamov and Solovyev 2000; Pertea and Salzberg 2002). Monocot trained versions of the tools were used for accuracy. Gene models that corresponded to organellar DNA, those with incomplete open reading frames and those encoding proteins less than 50 amino acids were omitted. The SplicePredictor programme was employed to correctly ascertain the exon-intron boundaries (Kleffe et al. 1996). Another automated annotation system, RiceGAAS (the Rice Genome Automated Annotation System), was also used extensively to predict genes and long terminal repeat sequences on the basis of homology search (Sakata et al. 2002). The predicted proteome was analysed, using various tools from RiceGAAS, for the presence of any functional domains (ProfileScan and MOTIF), intracellular localisation (PSORT) and solubility (SOSui). Conserved domains and motif searches and deciphering the gene ontology (GO) associations were completed using InterproScan in combination with Interpro2Go programme (Zdobnov and Apweiler 2001; Camon et al. 2005). The predicted proteome of rice was also searched using BLASTP against the proteome of several model species, including Arabidopsis, to identify homologs that could reflect on the potential functions of the genes. MUMmer and RepeatMasker software were used to delineate repetitive sequences in the rice genome (Delcher et al. 1999; Smit et al. 1996–2000). The Simple Sequence Repeat Identification Tool was used to identify SSR motifs and estimate the copy number of SSR markers (Temnykh et al. 2001). The TIGR Oryza Repeat Database, together with other sequence data on rice transposable elements, was used to create a rice transposable element database (RTEdb) (Juretic et al. 2004). The coordinates of these transposable elements were determined on the rice pseudomolecules. Organellar DNA sequences were also used to mask pseudomolecules, using BLASTN and MUMmer. Prediction of noncoding RNAs was also addressed. Prediction of transfer RNA genes was executed by the programme tRNA-scanSE (Lowe and Eddy 1997). miRNAs, spliceosomal and snoRNA sequences were retrieved from the Rfam database (http://rfam.xfam.org/) and used as BLASTN queries. In a different approach, experimentally validated miRNAs from other species, excluding Arabidopsis, were used for BLASTN queries against the pseudomolecules. To identify single nucleotide polymorphisms in the Nipponbare cultivar, end-sequencing of BAC clones from an O. sativa ssp. indica var. Kasalath BAC library was carried out, and the sequences were subjected to BLASTN analysis against the pseudomolecules (International Rice Genome Sequencing Project and Sasaki 2005).

3.1.4 Salient Features of the Rice Genome Reported by IRGSP

In 2005, the IRGSP presented the map-based, finished quality sequence of the rice genome that at the time included virtually all the euchromatin and two complete centromeres (International Rice Genome Sequencing Project and Sasaki 2005). The estimated genome size of the rice nuclear genome was 388.8 Mb (~389 Mb). The draft sequence was used to construct pseudomolecules representing the 12 rice chromosomes (haploid number), which were expected to cover 95.3% of the entire genome and 98.9% of the euchromatin. In a separate representation, 8391 of 8440 unique EST markers (99.4%) could be identified in these pseudomolecules. The centromere allocations were done for two chromosomes, 4 and 8. As discussed previously, all rice centromeres contain the repetitive CentO satellite DNA sequences, along with flanking centromere-specific retrotransposons (Dong et al. 1998; Cheng et al. 2002). Complete sequencing of the centromeric regions for chromosomes 4 and 8 indicated these chromosomes to contain ~59 kb and ~ 69 kb of clustered CentO repeats, respectively, arranged in tandem head-to-tail arrays within the clusters (Nagaki et al. 2004; Wu et al. 2004; Zhang et al. 2004). These CentO clusters varied in length and orientation between the two chromosomes and were frequently associated with centromere-specific transposable elements, like the retrotransposon RIRE7.

To figure out the gene content, distribution and density in the genome, the pseudomolecules were masked for repetitive sequences, and the gene prediction tools were employed to identify potential non-transposable element-related genes. At the time, a total of 37,544 non-transposable element-related protein-coding sequences were predicted, of which 22,840 genes could be correlated with ESTs or full-length cDNA models (International Rice Genome Sequencing Project and Sasaki 2005). A lower gene density of one gene per 9.9 kb was predicted in rice. Homolog search in cereals and dicots like Arabidopsis indicated a total of 2859 genes unique to rice and other cereals, which provided useful cues for differentiating between monocots and dicots. While majority of the genes encode for proteins with unknown function or hypothetical proteins, families of seed storage proteins called prolamins, hormone response proteins and defence proteins, such as proteinase inhibitors, chitinases, pathogenesis-related proteins and seed allergens, are absent in dicots. But with improvements in gene annotation methods, the number of true protein-coding genes in rice has been revised in the present day. Functional classification of 2296 rice genes according to agronomically desirable traits indicated the following percentage of genes in rice to have distinct functional roles (Wing et al. 2018):

Stress response 12.2% (367), disease resistance 7.4% (221), insect resistance 1.0% (31), growth and development 17.1% (513), phytohormone 15.8% (472), flower organ and heading date 9.2% (276), fertility 5.8% (174), yield 6.3% (189), grain quality 2.1% (63), domestication 1.0% (28), nutrient-use efficiency 6.9% (207) and others 15.2% (455).

Functional analysis of genes and relating them to phenotypes is the basic essence of rice research, and gene disruption with transposable elements is still an ongoing trend. Tos17 is an endogenous copia-like retrotransposon in rice, which prefers to insert into gene-rich regions, making it a suitable tool for functional analysis of the rice genes by gene knockout (Hirochika et al. 1996). A total of 11,487 Tos17 retrotransposon insertion sites were identified within the rice genome at the time, of which 3243 were in genes, suggesting the density of Tos17 insertions to be higher in euchromatic regions, compared to retrotransposons, which were frequently found in the paracentromeric regions (International Rice Genome Sequencing Project and Sasaki 2005). Another interesting outcome of the annotation was a large percentage of genes were arranged in tandem repeats, for example, the genes that encoded for protein kinase domains. Predictions of various types of noncoding RNAs distributed across the rice pseudomolecules indicated a total of 763 transfer RNA genes, 158 miRNAs, 215 small nucleolar RNAs (snoRNA) and 93 spliceosomal RNA genes. rDNA loci could also be ascertained on a number of chromosomes, including the nucleolar organiser at the telomeric end of the short arm of chromosome 9.

To accommodate the original endosymbionts inside the cells as organelles, a continuous transfer of organellar DNA to the nucleus has took place in the course of evolution, diminishing the sizes of the organellar genomes to their present-day sizes. Based on the parameters used during homology searches, IRGSP detected 421–453 chloroplast insertions and 909–1191 mitochondrial insertions in the rice genome, and on the basis of their distribution across the 12 chromosomes, it was suggested that mitochondrial and chloroplast transfers occurred independently (International Rice Genome Sequencing Project and Sasaki 2005). The rice genome was also found to be populated with transposable elements, belonging to Class I, Class II or neither of the two families. The transposon content of rice was predicted to be at least 35%, with elements belonging to all known transposon superfamilies. The Class I elements (e.g. long terminal repeats (LTRs) and non-LTR retrotransposons) were less in number compared to Class II elements (e.g. hAT, CACTA, IS256/Mutator, IS5/Tourist and IS630/Tc1/mariner superfamilies), and majority of Class I elements were found to be concentrated in heterochromatic regions such as the centromeric and pericentromeric regions. Class 1 SSRs which behave as hypervariable loci were also found in the rice genome. These repeats of >20 nucleotides in length have immense application in genetics and breeding and have been shown to be polymorphic in diverse cultivars of rice (McCouch et al. 2002). A total of 18,828 Class 1 SSRs (di-, tri- and tetra-nucleotide) were identified and annotated on the rice genome that represented 47 distinct motif families. Finally, intraspecific sequence polymorphisms were detected between the two cultivated rice subspecies, japonica and indica. Comparison of orthologous sequences between O. sativa ssp. indica cv. Kasalath and O. sativa ssp. japonica cv. Nipponbare identified a total contingent of 80,127 polymorphic sites, resulting in a high-resolution genetic map for rice. Insertions and deletions were also detected (International Rice Genome Sequencing Project and Sasaki 2005; Yu et al. 2002).

Genome-wide comparisons of the two published whole-genome shotgun assemblies of draft-quality rice sequences with the IRGSP pseudomolecules were also done. These assemblies predicted genome sizes of 433 Mb for O. sativa ssp. japonica cv. Nipponbare (6X coverage) and 466 Mb for O. sativa ssp. indica cv. 93–11 (6.28X coverage), which differed from the 389 Mb genome size reported by IRGSP. A detailed comparison of the sequences indicated that a substantial percentage of the contigs in both assemblies were misaligned or nonhomologous or provided duplicate coverage, suggesting the draft sequences to be inadequate for gene annotation or functional genomics.

3.2 Updates on the Current Rice Genome Data

The Michigan State University Rice Genome Annotation Project Database and Resource (MSU RAP-DB) is a project funded by the National Science Foundation (NSF). This provides sequence and annotation data for the rice genome, using the sequence of IRGSP, that of Oryza sativa spp. Japonica cv. Nipponbare (International Rice Genome Sequencing Project and Sasaki 2005). The parallel and complementary annotation efforts of the MSU RGAP and IRGSP/RAP are unified on the set of pseudomolecules that represent the 12 chromosomes, and therefore the gene loci, gene models and associated annotations are comparable. The findings from the latest unified Os-Nipponbare-Reference-IRGSP-1.0 pseudomolecules and MSU Rice Genome Annotation Project Release 7 are summarised in Table 3 (Kawahara et al. 2013). Continuous revision and revalidation of the minimal tiling paths were undertaken to update the sequence assembly. Advances in sequencing strategies, platforms and whole-genome sequencing allowed identification of sequencing errors in the revised assembly. Resequencing the genome of two different Nipponbare individuals using the Illumina Genome Analyzer II/IIx platform identified a total of 4886 sequencing errors in 321 Mb of the assembled genome. This indicated an error rate in the original IRGSP assembly to be only 0.15 per 10,000 nucleotides. Five insertions/deletions were also identified using the Roche 454 pyrosequencing platform. This resequencing data, generated from two different individuals, gave information regarding allelic differences between the original Nipponbare individual used in the IRGSP and the two individuals used in this resequencing effort.

Table 3 Summary of MSU RGAP Release 7 (2011)

Sequence annotation efforts have designated ~400 distinct gene families in rice. In the Pfam database (http://pfam.xfam.org/), which harbours a large collection of protein families, 42,365 domains are reported for 48,930 protein sequences from the rice proteome, while 5891 unique domain organisations or architectures are also listed. The kinase domain (Pkinase) is maximally represented in the proteome, with currently 517 sequences containing this unique domain. This is in agreement with several instances of tandem repeats, already known for the rice genome.

4 Systems Biology Resources for Rice Genomics

The major challenge after achieving completion of ‘gold standard’ rice genome sequence was to organise all the available primary and secondary data from diverse experimental sources. INE, discussed previously, served as a unified platform for integrated map displays for each chromosome (Sakata et al. 2000). But advances in bioinformatics have led to development of species-specific or general databases, which allowed efficient genome browsing, extension of the genetic information to multiple plant species and other organisms as well as cross-referencing data with functional capabilities. Apart from various genome browsers with novel features that have been developed over the last decade, a plethora of new generation analysis tools were also enabled for functional studies, transcriptome analysis, gene coexpression analysis, protein-protein interaction networks, metabolic pathway analysis, orthology identification analysis and assessment of indexed rice mutant genes (Chandran and Jong 2014; Hong et al. 2019). All these tools have their own significance and greatly enhance the functional implication of rice genes.

4.1 Genome Browsers

The concurrent release of genome sequences for Oryza sativa enabled generation of huge volumes of functional genomics data, made easy by high-throughput analysis tools. Therefore an integrated genome browser to visualise, navigate, analyse and annotate the rice genome was becoming indispensable for researchers and biologists. To serve this purpose of data organisation and visualisation, browsers such as Rice-Map (http://www.ricemap.org/), Rice Genome Annotation Project (RGAP, http://rice.plantbiology.msu.edu/), Rice Annotation Project Database (RAP-DB, https://rapdb.dna.affrc.go.jp/), Rice Functional Genomic Express Database (RiceGE, http://198.202.69.30/cgi-bin/RiceGE) and Gramene (http://gramene.org/) were eventually developed, each built on datasets differing from each other (Wang et al. 2011, Kawahara et al. 2013, Sakai et al. 2013, Tello-Ruiz et al. 2018). Rice-Map delivers several precomputed japonica and indica subspecies annotations and also provides an interactive interface for users to browse the different genomic features at multiple levels (Wang et al. 2011). RGAP provides the fundamental rice genome data, supplemented with various analysis tools such as BLAST search, GO retrieval, domain and motif search, etc. (Kawahara et al. 2013). In contrast, Gramene allows for extension of data to other candidates within the plant kingdom, or subspecies within Oryza (Tello-Ruiz et al. 2018). These genome browsers allow bulk data retrieval, batch query searches and efficient identifier mapping between loci listed in discrete databases.

4.2 Orthology Identification Databases

These databases have significantly augmented genome-wide analysis of a group of genes and comparative genomics research. For transferring the knowledge of economically important QTLs from one crop species to another, it is crucial that the orthologues are correctly identified. This knowledge can be exploited for development of markers in agronomically important crops. Such analysis can also boost the expanse of the predicted protein-protein interaction networks: experimentally proven interactions between a pair or group of orthologous protein interactors can be extrapolated to other species, thus facilitating functional studies. Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html) provides a centralised comparative hub for plant genome including land plants and algae (Goodstein et al. 2012). Each plant gene’s evolutionary history can be browsed starting right from its sequence, gene structure, gene family and organisation within the genome, thus providing a platform to compare genes across plant species. GreenPhylDB (http://www.greenphyl.org/cgi-bin/index.cgi) is another web resource containing a comprehensive catalogue of gene families, automatically clustered and manually annotated into orthologous groups (Conte et al. 2008). It is supplemented with a powerful tool, GreenPhyl Ortholog Search Tool (GOST), which can predict orthologous relationships between protein sequences across a broad taxonomy of green plants.

4.3 Transcriptome Databases

The ultimate goal in rice molecular biology is to correctly assign a functional role to all the predicted genes in the rice genome. Despite major developments in high-throughput analysis tools, the number of fully characterised genes is far more less than the number of predicted genes, still classified as ‘hypothetical’, ‘unknown’ or ‘unannotated’. Large-scale gene expression profiling from different organ, tissue or cell types at various developmental stages greatly enhances the efforts put into functional characterisation of rice genes (Roychoudhury et al. 2011). The rapid accumulation of microarray data for rice in public repositories allowed hand-in-hand development of diverse informatics platforms and data analysis tools (Roychoudhury and Banerjee 2015). Rice Expression Profile Database (RiceXPro, http://ricexpro.dna.affrc.go.jp/) was built on the Agilent 44 K microarray platform, with probes derived from manually curated gene models in RAP-DB and full-length cDNA sequence information extracted from Knowledge-based Oryza Molecular biological Encyclopedia (KOME, cdna01.dna.affrc.go.jp/cDNA/) database (Rice Full-Length cDNA Consortium 2003; Sato et al. 2011). The microarray analysis was done from organs/tissues of the rice plant at various growth stages under natural field conditions, rice seedlings treated with different phytohormones and specific cell types or tissues isolated by laser microdissection (LMD) techniques. The Rice Oligonucleotide Array Database (ROAD, http://www.ricearray.org/) integrates information from six rice microarray platforms, including the Affymetrix, Agilent 22 K and 44 K, BGI/Yale and the NSF 20 K and 45 K, thus making it very comprehensive, and provides a user-friendly web interface with various functional analysis tools (Cao et al. 2012). ROAD allowed meta-profile analysis for different anatomic tissues at various developmental stages, gene coexpression analysis and creation of coexpression networks and gene ontology (GO) and KEGG orthology (KO) analyses of query genes. With the emergence of next-generation sequencing technologies (NGS), RNA sequence-based profiling is rapidly gaining ground and is already replacing the databases built on microarray data, thus overcoming the limitations of microarray technique. Rice Expression Database (RED, http://expression.ic4r.org/) is a repository of gene expression profiles derived from curated and high-quality RNA-Seq data from tissues at various developmental stages or encompassing a wide variety of biotic and abiotic treatments (Xia et al. 2017). Similarly, AgriSeqDB (https://expression.latrobe.edu.au/agriseqdb) is an online RNA-Seq database with features for visualisation, analysis and interpretation of transcriptome data from various stages of development and tissue/cell types from several species, for major agricultural crops such as rice, wheat, maize, barley and tomato (Robinson et al. 2018). Transcriptome ENcyclopedia Of Rice (TENOR, http://tenor.dna.affrc.go.jp) provides comprehensive large-scale mRNA-Seq data obtained from rice sampled from a variety of conditions: ten abiotic stress conditions such as high salt stress, osmotic stress, high and low phosphate or cadmium levels, drought, cold and flood and two plant hormone treatments (abscisic acid and jasmonic acid) (Kawahara et al. 2016).

4.4 Promoter Databases

Promoter databases provide information regarding the core promoter structures and regulatory elements, which have been experimentally verified or predicted from consensus. Various informative resources for promoter detection and analysis are available for different plant species like rice, Arabidopsis, poplar, etc. Plant Promoter Database (PPDB, http://ppdb.agr.gifu-u.ac.jp/ppdb/cgi-bin/index.cgi) is a web-based promoter database which comprises of sequence lists of bioinformatically identified promoter elements, extracted by local distribution of short sequence analysis (LDSS) (Kusunoki and Yamamoto 2017). Promoter recognition is accomplished by annotating genome sequence to these lists of TATA boxes, initiators, GA and CA elements, Y patches and regulatory element groups (REGs), supplemented with information on experimentally demonstrated transcription start sites (TSSs). Additionally, REGs are linked to the information in the Plant cis-acting regulatory DNA elements (PLACE, http://www.dna.affrc.go.jp/htdocs/PLACE/) database (Higo et al. 1998). PLACE is a database of nucleotide sequence motifs of regulatory elements extracted from published reports on genes in vascular plants and other plant species. It is appended with a Signal Scan programme that allows users to search for cis-regulatory elements in their query sequence. The outputs are assigned PLACE accession numbers and are hyperlinked to PubMed or GenBank identifiers. The Plant Promoter Analysis Navigator (PlantPAN, http://plantpan2.itps.ncku.edu.tw/) tool allows users to search for transcription factor binding sites (TFBSs), respective transcription factors (TFs) and several important cis- and trans-regulatory elements in their query promoter sequences or set of promoter sequences in plants (Chang et al. 2008). It also allows determining co-occurrence of TFs and their binding sites for the promoters of the input gene groups and construction of gene-regulatory networks using coexpression analysis. PlantPAN also provides cross species promoter search utilising paralogs and orthologues.

4.5 Databases for Rice Coexpression Analysis

Gene coexpression analysis is performed to associate genes of known or unknown function with biological pathways, by discerning the correlation patterns among them across microarray and other transcriptomic datasets. But with recent advances in next-generation sequencing, gene coexpression networks built on RNA-Seq data enable improving the definitions of gene function and associating noncoding genes and splice variants with some biological roles. RiceFREND (http://ricefrend.dna.affrc.go.jp/) is a gene coexpression database built on large volumes of rice microarray data, derived from various tissues at different developmental stages under natural field conditions or under the influence of some phytohormone treatments – the same Agilent 44 K array dataset available in the RiceXPro database (Sato et al. 2011, 2013). RiceFREND provides a platform with two modes of search: single guide gene and multiple guide genes mediated search for coexpressed genes or functionally related genes in various signalling or metabolic pathways. As already mentioned previously, PlantPAN also has options for gene coexpression analysis and for construction of regulatory networks based on co-occurrence TFs and protein-protein interactions (Chang et al. 2008). OryzaExpress is another integrated database, providing interactive user interface to compare gene expression networks in rice with that of other plants (Arabidopsis Gene Expression Network data from ATTED-II) (Hamada et al. 2011).

4.6 Phylogenomics Databases

Experimental determination of biological functions of genes within large families becomes problematic because redundancy limits exactly assigning the contribution of individual genes. As of 2019, 1072 genes have been reported in the collection of the Overview of functionally characterized Genes in Rice Online database (OGRO, http://qtaro.abr.affrc.go.jp/ogro/) (Yamamoto et al. 2012). Therefore, a combinatorial approach of incorporating phylogenetics into comparative genomics had been undertaken to predict the functions of candidates within large-gene families considering the similarity among gene products: a phylogenomics analysis. In rice, phylogenomics databases have been constructed for six gene families (http://ricephylogenomics.ucdavis.edu/index.shtml): kinases, glycosyltransferases (GTs), glycoside hydrolases (GHs), transcription factors (TFs), transporters and cytochrome p450 monooxygenases (P450s). The Rice Kinase Database (RKD) derives information from the NCBI Gene Expression Omnibus (GEO) and provides a platform to integrate functional genomics data into phylogenetic trees built for individual families (Dardick et al. 2007; Jung et al. 2010). The RKD includes an interactive chromosomal map depicting the position of kinase genes, protein-protein interaction maps and meta-expression data developed on microarray data analysis. Thus RKD facilitates effective estimation of functional redundancy or dominance among closely related kinases within subfamilies. The Rice GT database integrates functional genomics information of all putative rice glycosyltransferases on a phylogenetic tree (Cao et al. 2008). All these putative GTs have been identified through similarity searches against annotated GTs in the Carbohydrate-Active enZymes (CAZy) database (http://www.cazy.org/) and subsequently classified into subclasses, based on domain composition and sequence similarities (Lombard et al. 2014). Similarly, the rice GH database lists all the GHs identified through sequence similarity searches in the CAZy database and shares a lot of features with the GT database (Sharma et al. 2013). It incorporates several features such as gene expression patterns, orthologous relationships, structural information and mutant availability for individual GH families in a phylogenomics context. These efforts have provided some significant insight into understanding of cell wall structure and biotic and abiotic stress tolerance. The Rice TF database (http://ricephylogenomics.ucdavis.edu/tf/) integrates information on putative and predicted rice TFs and transcriptional regulators, retrieved from the Plant Transcription Factor Database (PlnTFDB, http://plntfdb.bio.uni-potsdam.de/v3.0/) (Jin et al. 2017). Another database that provides comprehensive information regarding rice TF expression during drought and salinity stress and at various developmental stages is the Rice Stress-Responsive Transcription Factor Database (RiceSRTFDB, http://www.nipgr.res.in/RiceSRTFDB.html) (Priya and Jain 2013). This enables a holistic view of the TF of interest involved in stress response at a particular developmental stage, the cis-regulatory elements in the promoters, mutant availability and phenotype information. The Rice Transporter database (http://ricephylogenomics.ucdavis.edu/transporter/) contains all the potential rice membrane transporter genes, retrieved from the Transporter Protein Analysis Database (TransportDB; http://www.membranetransport.org/), while the Rice CytP450 Database (http://ricephylogenomics.ucdavis.edu/p450/) was developed to integrate functional genomics information for all putative rice P450s in a phylogenomics context (Elbourne et al. 2017). Both these databases are supplemented with similar features like orthologous relationships, gene expression patterns, phenotype information, etc., as discussed above for other databases.

4.7 Proteome Databases

Proteome databases in rice have been built on diverse platforms or data acquisition types, thereby each having their own significance. Rice Proteome Database (RPD, http://gene64.dna.affrc.go.jp/RPD/main_en.html) contains a catalogue of rice proteins, extracted from various tissues and subcellular compartments and separated and analysed on two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) (Komatsu 2005). The database contains an ensemble of reference maps based on 2D-PAGE, with sequences of individual proteins, as well as functional characterisation of major proteins. Plant Proteomics Database (PPDB, http://ppdb.tc.cornell.edu/) contains an assembly of all protein-encoding gene models in Arabidopsis, maize and rice, which are linked to each other via BLAST options (Sun et al. 2009). Each gene model is associated with some experimental information which has been derived from in-house proteome and mass spectrometry analysis and is also supplemented with curated information regarding protein function, properties and subcellular localisation. Nowadays, with the advances in mass spectrometry, large-scale and quality-controlled peptide sequencing data are available. OryzaProteogenomics Database (OryzaPG-DB, http://oryzapg.iab.keio.ac.jp/) is a data repository of shotgun proteomics data, built from the outputs of 27 nanoLC-MS/MS runs, on a hybrid ion trap-orbitrap mass spectrometer, of tryptic digests from undifferentiated cultured rice cells (Helmy et al. 2011). Peptides were identified, when the product ion spectra were searched against protein, cDNA, transcript and genome databases and were subsequently mapped onto the rice genome. OryzaPG is the first proteogenomics-based database of the rice proteome, which associated peptide-based expression profiles with corresponding genomic origin.

4.8 Protein-Protein Interaction Databases

These databases have flourished due to the recent developments in techniques employed to identify the potential interactors of a protein candidate. Understanding the protein interactomes from yeast two-hybrid (Y2H), split-ubiquitin (sUbq), bimolecular fluorescence complementation (BiFC), fluorescence resonance energy transfer (FRET), affinity purification (AP), co-immunoprecipitation (Co-IP) and mass spectrometry (MS) has reignited the possibilities of comparative analysis of protein-protein interactions (PPIs). Thus based on the assumption that protein complexes across different species share significant structural and functional similarity, evolutionary conserved proteins will not only retain their structure and function but also the interactions with other protein partners. Predicted Rice Interactome Network (PRIN, http://bis.zju.edu.cn/prin/), a well-annotated PPI database for rice, extended the current expanse of the PPI data by integrating information from experimentally verified interologs of six model organisms (Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fruit fly), Homo sapiens (human), Escherichia coli K12 and Arabidopsis thaliana), using computational approaches (Gu et al. 2011). Supplemented with GO assignments, subcellular localisation data and gene expression data, PRIN provides a user-friendly web interface for easy database search and effective network visualisation. STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, https://string-db.org/) is another database of established and predicted protein-protein interactions, which are of direct (physical) or indirect (functional) nature (Szklarczyk et al. 2015). Information regarding interactions in STRING database stems from genomic context predictions, high-throughput lab experiments, coexpression data, automated text-mining, identifying interologs in other organisms and interactions reported in other primary databases.

4.9 Databases for Metabolome Analysis

Metabolomics addresses the global metabolic changes in biological systems and has immense application in disease diagnosis and functional genomics. Metabolic profiling of cells, tissues or organisms, biomarker identification and pathway analysis are an integral part of it and demand easy-to-use analysis tools. MetaboAnalyst (https://www.metaboanalyst.ca/) provides a web-based user interface with metabolomics data processing tools, options for data normalisation, statistical analysis, graphing, metabolite identification and characterisation as well as pathway mapping (Chong et al. 2018). MetPA (Metabolomics Pathway Analysis, https://www.metaboanalyst.ca/) is a web-based tool, which combines advanced pathway enrichment analysis and pathway topology analysis, to visualise and analyse metabolic pathways involved in a particular metabolomics study (Xia and Wishart 2010). MetPA currently allows visualisation and analysis of a total of 1600 pathways for 21 model organisms, including rice. EXPath (http://expath.itps.ncku.edu.tw) uses public large-scale microarray datasets, derived from samples under abiotic or biotic stress or under different hormone treatments, for gene coexpression analysis and identification of differentially expressed genes (DEGs) (Chien et al. 2015). Down the pipeline, this information is finally utilised for inferring enriched KEGG pathways and GO terms for three plant species, namely, Arabidopsis, rice and maize.

4.10 Rice Gene Indexed Mutant Databases

The rice research community has greatly benefited from the high quality of the rice genome and also the appreciably fair annotation in other cereal crop species. To elucidate the function of all the predicted coding and noncoding regions within the rice genome, the International Rice Research Institute (IRRI) in close association with the International Rice Functional Genomics Consortium (IRFGC) developed an enormous collection of indexed rice mutant genes. The members within a population of gene indexed mutants are distinguished on the following criteria: mutagenesis methods adopted (T-DNA insertion, transposons, chemical or physical mutagens), rice variety, mutant phenotypes, seed availability, reporter-gene expression patterns and mutated loci per genome. All this information facilitated further functional characterisation of the genome. Rice Mutant Database (RMD, http://rmd.ncpgr.cn/) encompasses the information regarding ~129,000 rice T-DNA insertion (enhancer trap) lines generated by an enhancer trap system and delivers three distinct functional classes: (1) novel gene identification, (2) identification of regulatory elements and (3) identification of patterns of ectopic expression of target gene at particular growth stage or tissue type (Zhang et al. 2006). Oryza Tag Line (http://oryzatagline.cirad.fr/) is a phenotypic mutant database for the French genomics initiative ‘Genoplante’ rice insertion line library (Larmande et al. 2008). Based on the molecular characterisation of the mutagen insertion sites, the sequence information of flanking sequence tags (FSTs) was retrieved. This database also allows forward genetic search through queries based on mutant phenotype or reporter-gene expression, coupled with other categories such as organ, developmental stage or trait. OryGenesDB (http://orygenesdb.cirad.fr/data.html) displays sequence information on T-DNA and Ds FSTs in insertional rice mutants, produced in the framework of Genoplante and the EU consortium Cereal Gene Tags (Droc et al. 2006). This database is supplemented with tools for reverse genetics, allowing a molecular geneticist to readily find T-DNA, Tos17 and/or Ds insertion lines in genes of interest and to retrieve all the annotations associated with these sequences, derived from external rice molecular resources (e.g. cDNA full length, gene, EST, markers, expression data, etc.).

4.11 Other Databases

Apart from the above listed categories of databases, there are other web resources available, each built on diverse data sources and possessing unique features and tools compatible for functional genomic annotations. RiceVarMap (http://ricevarmap.ncpgr.cn/v2/) provides curated information of genomic variations—single nucleotide polymorphisms (SNPs) and small insertions/deletions (INDELS)—from sequencing data of ~4700 rice accessions (Zhao et al. 2015). The variations are associated with comprehensive resources of functional annotations, chromatin accessibility data, risks associated with gene expression, phenotype data and images, agronomic and metabolic traits, geographical details, etc. DroughtDB (http://pgsb.helmholtz-muenchen.de/droughtdb) is a valuable tool for researchers working on drought stress (Alter et al. 2015). This database is a manually and expert-curated compilation of drought stress-responsive genes in plants, which have been molecularly characterised. Each drought stress-responsive gene, already characterised either in Arabidopsis or Oryza sativa, is supplemented with information on its sequence, physiological or molecular function, mutant phenotypes and their homologs in nine model crop species, like maize, barley, etc. With new insights into the role of microRNAs in plant growth and development, repositories of microRNA data are also being developed. The plant microRNA database (PMRD, http://bioinformatics.cau.edu.cn/PMRD/) congregates available information on plant microRNA data, for example, sequence of microRNAs and their target genes, secondary dimension structure, expression profiling, genomic origins, etc. (Zhang et al. 2010). Gene duplications are widespread in plant species. For example, receptor-like kinase families in Arabidopsis and rice have nearly 600 and 1000 members, respectively. This implies gene duplication and subsequent mutations have generated new genes with diverse functions. In similar lines, recent developments in gene coexpression networks have implied that not only genes but pathways can also undergo multiplication and diversification to perform related functions in various parts of an organism. Such pathways are called modules, which may give us a lead to understand the biological processes and functions well. FamNet (http://aranet.mpimp-golm.mpg.de/famnet.html) is an interactive platform for exploration and visualisation of these multiplied modules in gene coexpression networks of eight plant species (Ruprecht et al. 2016). With this tool, multiplied modules involved in tip growth in pollen tubes and root hairs or in secondary metabolite synthesis were identified and functionally characterised in separate plant species.

Recent advances in development and refinement of new and existing high-throughput analysis tools have significantly impacted the functional annotation of the rice genome. From browsing the genome to deciphering phylogenomic relationships, from mapping orthologues to identifying SNPs and INDELs, from identification of gene coexpression networks to metabolic pathways, these tools have leveraged the quality of meaningful predictions. Table 4 lists the diverse omics tools—browsers, databases, mapping tools, etc.—that have not been touched upon in the above section, but are used at large for functional genomics.

Table 4 List of genome browsers, databases and diverse omics tools which facilitate high-throughput analysis of the rice genome

5 Rice Genomics Propelled Research on Important Agronomic Traits

The availability of the rice genome sequence data and development of new tool sets to analyse the data has greatly revolutionised the research in rice genetics and breeding. A search in the NCBI PubMed database with the keywords ‘map-based cloning’, ‘rice’ and ‘Oryza’ has retrieved nearly over 4700 publications corresponding to rice research, from the past three decades (see Fig. 3). Following the establishment of the Rice Genome Research Program in Japan, there was a major boost in rice research, with 40–60 publications per year. This was nearly 2–3 times the publication volume before RGP was established. But following the public release of the rice genome sequence by IRGSP in 2005, which incorporated the draft sequences from Monsanto and Syngenta, there were, on an average, more than 200 publications per year pertaining to rice research. It is evident from the volume of publications in the last 10 years that the numbers are on the rise, which has been propelled with the genomes of several wild and domesticated rice varieties and other cereals being sequenced. This rough analysis, which does not encompass any manual curation of the retrieved citations, simply indicates the enormous advancement in the rice research area with the completed rice genome sequence, with a multitude of genes being associated with metabolic, developmental or stress-related signalling pathways (Wing et al. 2018; International Rice Genome Sequencing Project and Sasaki 2005). Moreover, the reference genome of rice allowed exploration into the genomes across the entire Oryza genus as well as studying of genetic variations among domesticated rice species and their wild relatives (Wing et al. 2018; Civáň et al. 2015; Wang et al. 2014; Meyer et al. 2016). Sequence-based analysis of the variations in different species allowed breeders to exploit such variations for rice improvement. But none of this was possible without the significant increase in the number of molecular markers, and access to the knowledge about their physical order in chromosomes and proximity to annotated genes, all of which was utilised to predict gene-trait associations. Another noticeable impact of the rice genome sequence was an opportunity to gain an understanding of the molecular and genetic basis of the traits such as efficient mineral utilisation, resistance towards biotic and abiotic stress, physical features and so on. All this insight helped rice researchers to engineer ‘Green Super Rice’, a new generation of sustainable crops that will tackle the food demands of the growing population.

Fig. 3
figure 3

A timeline for the period 1990–2019 depicts nearly 4783 publications related to research on gene cloning in rice, retrieved from the NCBI PubMed database. For the year 2019, the records listed are till the month of March. The dotted arrows indicate the years when RGP and IRGSP were established and the map-based genome sequence of rice was released

Limitations in arable land and water resources in rice-producing countries and the issues related to the increasing trend of global population growth can be addressed to a great extent by increasing per unit area yield of rice. This has been largely possible in the last decade because of correctly identifying the molecular markers determining the agronomically desirable traits and using them effectively in rice production.

5.1 Improving Plant Architecture and Yield

The semidwarf trait has been introduced into a number of rice cultivars growing worldwide, as it improves light-interception properties, harvest index, nitrogen responsiveness and lodging resistance, without having non-detrimental effects on the grain yield. The International Rice Research Institute developed a semidwarf variety of rice IR8, from a cross between Peta, a tall indica variety, and Dee-geo-woo-gen (DGWG), a Taiwanese indigenous semidwarf rice variety. This new variety led to a dramatic improvement in rice yields, thereby bringing about the Green Revolution to tropical Asia (Hargrove and Cabanilla 1979). IR8 had a culm length of 90–100 cm, compared to its parent line, Peta, characterised by longer culm (150–180 cm) and long hanging leaves. Sequence data from the rice genome and supporting mapping studies facilitated identification of genes responsible for the semidwarfism trait in rice (Monna et al. 2002; Spielmeyer et al. 2002). The incomplete recessive gene, sd1, located on the long arm of chromosome 1, encodes for GA20-oxidase with loss-of-function mutations (OsGA20ox2). This gene product regulates synthesis of gibberellins, which regulates the GA biosynthesis pathway. Semidwarfism is one of the traits that got positively selected during domestication of japonica rice. In addition to affecting plant height, it has positive effect on the tiller number and enhances erection of the blade.

Apart from plant architecture, other component traits that determine yield of a rice plant are tiller number, number of grains per panicle, grain weight and grain filling rate. The MONOCULM 1 (MOC1) gene is the first gene identified related to controlling rice tiller number (Li et al. 2003). MOC1 is a member of the plant-specific GRAS transcription factor family. MOC1 is required for growth of axillary meristems at both vegetative and reproductive stages. Consequently, loss of function of MOC1 affects formation of both tiller and panicle. Ghd7 encodes a CCT domain containing transcriptional activator that simultaneously controls number of grains per panicle, plant height and heading date (Xue et al. 2008). Another QTL with similar effects was Ghd8, which encodes the OsHAP3 subunit of a heterotrimeric heme activator protein (HAP) (Yan et al. 2011). This gene manipulated flowering time in rice and upregulated MOC1, thereby increasing the number of tillers and grain yield. Along with Ghd7.1, which encodes a PSEUDO-RESPONSE REGULATOR (PRR), Ghd7 and Ghd8 render pleiotropic effects (delays heading, increases plant height and grain yield) under long-day conditions. Map-based cloning facilitated identification of Hd1 gene, which largely affected the interaction between Ghd7 and Ghd8 (Zhang et al. 2015). Recently it has been demonstrated that combinatorial loss-of-function alleles of Ghd7, Ghd8 and Hd1 allow expansion of rice cultivars to higher altitudes, therefore defining the ecogeographical adaptation and yield potential in rice cultivars. High-yielding varieties can also be associated with high grain numbers per ear. Several genes, such as the DENSE AND ERECT PANICLE genes, DEP1, DEP2 and DEP3, have been identified that affect this trait (Xu et al. 2016). Gn1 gene encoding a cytokinin oxidase OsCKX2 has been identified by map-based cloning. Downregulation of Gn1 expression leads to an increment in the quantity of glumous flowers, thereby grain number per ear (Yeh et al. 2015). IPA1 (Ideal Plant Architecture 1) encodes the transcription factor OsSPL14, which regulates the plant architecture through DEP1 (Zhang et al. 2017). Fewer but more productive tillers are produced upon achieving optimal expression levels of IPA1, thereby regulating plant height and ear length.

5.2 Improving Grain Quality

This demand of enhanced grain quality is primarily determined by consumers and is thereby important to farmers and millers. The quality of rice grain depends on a number of features, which vary according to the preferences of consumers across the world. The grain size, shape and its translucent appearance are important features determining grain quality and are directly related to yield. A complement of genes has been indicated to control these traits. GS3, GW2 and GW5 are negatively correlated with grain length and width, grain weight and grain fullness (Zheng et al. 2015). Upregulation of GS5 expression is involved in seed yield, size, thousand seed weight and seed setting rate as well as grain width and weight. GW8 expression promotes cell division and regulates grain size, thousand seed weight and filling speed, thereby increasing yield and affecting milling quality. Milling quality is largely determined by the chalkiness and intact nature of the milled rice. Grain chalkiness is highly undesirable and a major QTL, Chalk5, influences this trait. Elevated expression of Chalk5 increases chalkiness of the endosperm, thereby posing a major problem in milling and post processing (Li et al. 2014). A perfect combination of amylose content and gelling temperature influences the cooking quality and palatability of rice. The WAXY gene codes for the starch synthase enzyme, which is responsible for synthesis of amylose in the endosperm. Based on the amylose content (AC), rice is commercially classified into five categories: high (25–33%), intermediate (20–25%), low (12–20%), very low (5–12%) and waxy (0–5%) (Juliano 1992; Yamanaka et al. 2004). Grains with high AC result in dry and well-separated rice after cooking but eventually turn hard upon cooling. Cooking of rice varieties with intermediate AC leads to a soft, nonsticky texture in rice, while low and very low AC results in a soft and sticky texture. Waxy rice (also called, sweet rice or glutinous rice) becomes sticky when cooked. Using map-based cloning strategy, ALK was identified to be the key gene controlling gelatinisation temperature and therefore the gel consistency in rice (Gao et al. 2003). Although rice is a major carbohydrate and even a potential protein source for rice-eating populations, it is a poor source of essential micronutrients. For countries where rice is the staple food, the primary cause of micronutrient malnutrition, including iron, zinc and vitamin A deficiencies, is a rice-based diet. All these deficiencies can be related with reduced working capacity, decreased mental capacity, blindness, stunting and elevated morbidity and mortality. To address this issue, food-based approaches have been adopted that involve biofortification of genotypes resulting in increased levels of vitamins and minerals. Golden Rice, expressing two genes encoding phytoene synthase and carotene desaturase, produced the provitamin A carotenoid β-carotene in rice endosperm and is one of the earliest examples of biofortification (Schaub et al. 2005). Increasing the seed iron content by overexpression of the iron storage protein ferritin in rice grains and controlling the mineral status by overexpressing ion transporters are other strategies adopted. Ongoing efforts aim at improvising the rice grain with other micronutrients, specifically high-quality protein and vitamin E. Other approaches include screening germplasm collections for detecting variation in mineral content (Gregorio et al. 2000) and initiating rice breeding programmes to develop mineral-rich genotypes, using high iron/zinc germplasms and crossing it with plants containing other micronutrient traits (e.g. vitamin E, high protein, β-carotene). A different approach of rice biofortification can be adopted to regulate/reduce the levels of compounds such as phytate and tannins, both of which complex with minerals and prevent their absorption by the body during digestion. This approach simply aims to enhance mineral bioavailability (Welch and Graham 2004).

5.3 Enhanced Efficiency of Nutrient Use

Reducing dependency on fertiliser application can be achieved by gaining a deep understanding of the molecular mechanisms underlying nitrogen and phosphorus usage. Overexpression of OsNRT1.1A (OsNPF6.3), a member of the Oryza sativa nitrate transporter 1/peptide transporter family, in rice significantly improved nitrogen utilisation and grain yield; additionally, the maturation time was also shortened (Wang et al. 2018b). Overexpression of OsNPF7.2, a low-affinity nitrate transporter, significantly enhanced nitrate influx and promoted tillering, thereby improving grain yield (Wang et al. 2018a). Likewise, overexpression of OsNRT2.1, a high-affinity nitrate transporter, increases yield and manganese accumulation, while high expression of OsNRT2.3b, a pH-sensitive nitrate transporter, enhances the pH-buffering capacity of the plant, promoting plant adaptation, and also facilitates increased uptake of nitrogen, iron and phosphorus (Fan et al. 2016; Luo et al. 2018). The phosphate transporter gene OsPht1;8 (OsPT8) increases Pi uptake and is involved in phosphate homeostasis in rice (Jia et al. 2011). Members of the sulphate transporter family in rice (OsSULTR) play an important role in regulating sulphur demand by the plant (Sasaki et al. 2016). OsHAK5, a high-affinity K transporter, controls potassium acquisition and distribution in the rice plant (Chen et al. 2017). Therefore, identification and manipulation of potentially useful genes have to be done with an aim to maximise uptake, allocation and utilisation.

5.4 Increased Resistance to Abiotic and Biotic Stress

Developing rice varieties that can withstand frequent and harsh changes in the climatic conditions is an important goal of rice research worldwide. These abiotic stresses include drought, flood, temperature (heat and cold), salinity and diverse soil problems like iron toxicity. Tremendous efforts are being put to identify genes or genomic regions responsible for conferring resistance or susceptibility to these conditions (Roychoudhury and Paul 2012; Golldack et al. 2014). Breeders have adopted a breeding method called marker-assisted breeding, to accurately incorporate specific desirable traits into novel varieties at a faster rate. Through the use of advanced technology and contribution from national institutions, the International Rice Research Institute has developed ‘climate change-ready rice’ that are able to tolerate harsh climatic conditions (https://www.irri.org/climate-change-ready-rice). Several QTLs have been identified, which are being extensively studied to gain a better insight into their molecular and physiological processes. OsNAC10 and OsAHL1 confer drought tolerance and drought avoidance by regulating root development under stress conditions (Jeong et al. 2010; Zhou et al. 2016). The SUB1A gene that was derived from an Indian rice variety was found to conserve energy till floodwater reduces, thereby conferring resistance to submergence. Two ethylene response factors SNORKEL1 and SNORKEL2 enabled adaptation of rice to deepwater (Hattori et al. 2009). QTLs, e.g. Ctb1 and COLD1, were associated with chilling tolerance, while TT1 and OsRab7 are linked to thermotolerance (Gardener and Kumar 2015). A major region of the rice genome, named as Saltol, has been identified by scientists that confers tolerance to salinity. Saltol is being exploited at large to develop crop varieties that can tolerate exposure to salinity stress at both seedling and reproductive stages (Thomson et al. 2010). This approach, combined with submergence tolerance trait, is targeted to rice varieties cultivated in coastal areas, where during rainy season, salinity and submergence are major problems. Additionally several QTLs have been identified from genetic maps that could be associated with iron or aluminium toxicity tolerance and mineral homeostasis. Several of these genes have also been integral parts of breeding programmes of climate change-ready rice.

The increase in the number of cloned disease resistance genes has been greatly accelerated by the map-based sequence of the rice genome. A molecular and functional understanding of these genes gave insights into devising strategies to prevent yield loss during bacterial blight and fungal blast infections, whose causal pathogens are Xanthomonas oryzae pv. oryzae and Magnaporthe oryzae, respectively. Additionally many genes have been fine-mapped, paving the path to identify functional markers. Marker-assisted selection was employed to introgress such genes into parent varieties to develop elite hybrids. The R gene in rice Xa3 (also known as Xa26) displayed enhanced resistance level and enlarged resistance spectrum (i.e. resistance to multiple Xoo races) when expressed under native promoter in japonica rice (Cao et al. 2007). The atypical protein with armadillo repeats, encoded by the Ptr gene, confers broad-spectrum blast resistance (Zhao et al. 2018). Involvement in conferring resistance to bacterial and fungal pathogens was also demonstrated for OsGAP1 encoding a GTPase-activating protein (GAP), receptor-like kinases like OsBRR1 and XA21 and transcription factors like WRKY45 (Singh et al. 2018). Although marker-assisted selection provides an effective strategy for pyramiding of disease resistance genes, it can be combined with the transgenic approach using cloned genes for developing disease-resistant rice cultivars.

6 Conclusion and Future Perspectives

6.1 Rice Genome Sequence: A Milestone in an Expanse of Untrodden Information

Access to the sea of genomic sequence information of a cereal crop plant was a revolutionary achievement in the twenty-first century. This is solely because rice is a quintessential component of the diet and livelihoods of several million people globally. There is always a befitting demand to increase the yield, productivity, health and acceptability of the plant, which has been largely addressed across generations through traditional crop-breeding programmes. The nature of demands has undergone a noticeable change in the present global scenario, considering uncontrollable population growth, climate change, loss of cultivable land, greenhouse gas emissions and excessive use of pesticides and fertilisers. The rice genome sequence at hand has enabled map-based cloning of QTLs, marker-based characterisation of beneficial genes, functional analysis by developing functional knockouts with mutation or gene tagging and exploration of possibilities of introgression of these genes or genomic regions to develop new agronomically competent hybrids. Moreover, it also facilitated a better understanding of the successful ecogeographical adaptation of rice, genetic variation between wild and domesticated species and also the molecular components underlying agronomically beneficial phenotypes.

The opportunity to map genes to genome provides useful information regarding features of micro- and macro-level synteny of the rice genome to other cereal crops. Such comprehensive molecular linkage maps are the primary components of the toolkit required to breed improvised crops. Following functional assignments of genes, a large amount of information can be extended to other plant species, simply by the virtue of extensive colinearity in the conservation of gene order and content within members of the dicot and monocot families. Comparative mapping of grass genomes, using rice as the reference genome, revealed considerable macro-colinearity, represented by a graphical consensus map, ‘Circle Diagram’ (Moore et al. 1995). Further refinements of such maps gave the general impression that despite significant variations in DNA content or chromosome number, all the cereal crops examined tend to maintain similar gene order, thus maintaining significant micro-colinearity at the mega-base level. Deletions, insertions, small-scale rearrangements or even gene amplification, duplication and translocations can account for the deviations observed. This extensive genome colinearity is exploited for fine mapping and map-based cloning, as several crop plants have large genomes and the extra information derived from the closely related model crop, rice, greatly facilitates cloning of genes in other crops.

Recently, an international resequencing effort of 3000 rice germplasm accessions in the Illumina-based next-generation sequencing (NGS) platform had been undertaken to understand the total genetic diversity within the Oryza sativa gene pool (3000 rice genomes project 2014). The International Rice Research Institute (IRRI) contributed 2466 accessions from its International Rice Genebank Collection (IRGC), while the remaining 534 accessions were obtained from the China National Crop Gene Bank (CNCGB) in the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS). The temperate O. sativa spp. japonica Nipponbare (Os-Nipponbare-Reference-IRGSP-1.0) was used as the reference genome for mapping of the selected reads, simply indicating that the significance, versatility, expanse and quality of this genome sequence. When aligned with the reference genome, innumerable single nucleotide polymorphisms could be mapped, depending on which the O. sativa gene pool could be differentiated into five varietal types of diverse origins: indica, aus/boro, basmati/sadri, tropical japonica and temperate japonica. This effort not only gave access to enormous volumes of SNP data but also a knowledge-based tool to discover allelic variants and associate important rice traits with diverse alleles. Additionally, a number of high-quality reference genomes also became accessible for rice genome research.

Another revolutionary development that happened alongside was the emergence of diverse omics web-based tools. Genome browsers and databases were developed and were gradually getting equipped with high-throughput analysis tools for comparative and functional genomics, transcriptomics, metabolomics, phylogenomics, interactomics, etc. Integration of information from different experiments—microarray, RNA-Seq, promoter analysis, yeast two-hybrid, mass spectrometry, co-immunoprecipitation, bimolecular fluorescence complementation and fluorescent protein localisation, to name a few—has now enabled researchers to understand the temporal and spatial expression patterns of a target gene, identify other targets that coexpress with the gene of interest, predict the functionality based on sequence and structural similarities, predict potential interactors and, thereby, possibly identify the signalling or metabolic pathway the candidate gene is involved in. In fact, several computational systems biology studies have been undertaken to clearly understand the mechanisms of salt tolerance or the regulatory crosstalks in the hormone biosynthesis pathways for modulating responses to biotic and abiotic stress (Wang et al. 2013; Deb et al. 2016). This advancement in the diversity, user accessibility and versatility of the omics-driven tools, software and programmes is gradually clearing up the ambiguity related with the rice genome.

However, this information explosion is meaningless and stagnated until a rice researcher is able to associate functional, structural or phenotypic implications with the corresponding genes. Despite the total number of genes worked out from the rice whole-genome sequence, a total of ~3100 gene loci have been analysed till date for biological functions using various approaches (http://funricegenes.ncpgr.cn/). This disparity between analysed genes and total number of genes is due to the restrictions imposed by functional redundancy between the genes and the limited data available on gene function. Additionally several QTLs have been associated with different traits, but very few have been fine-mapped, identified or even explored for proper functions. At this stage of rice research, where there is a need for delivering a sustainable crop plant, more initiatives have to be undertaken to understand the functionality of the genome in entirety, to delineate the effect of allelic differences on gene functions, to dissect the molecular basis of ecogeographical adaptation of rice varieties and to explore the feasibility of manoeuvring these traits to our benefits.

Development of new improved varieties of cereal crops was a significant achievement of Green Revolution, which was possible through the efforts of Norman Borlaug, the father of the Green Revolution. Nearly half a century ago, the International Rice Research Institute (IRRI) developed IR8 variety of rice, also known as ‘miracle rice’ (Hargrove and Cabanilla 1979). This variety yielded more grains than the traditional ones, when grown with proper irrigation and nitrogen-rich fertilisers. But this miracle also cost a high price: excessive and indiscriminate use of chemical fertilisers and pesticides affected the environment, commercial fertilisers posed an additional cost to the rice farmers, and irrigation demands adversely affected the global water supply, which also faces additional pressure due to climate change, pollution, population growth, industrial requirements, etc. So the demand of an environment-friendly solution was addressed in a collaborative project between IRRI and the Chinese Academy of Agricultural Sciences (CAAS). Funded by the Bill & Melinda Gates Foundation (BMGF), the project aims to develop Green Super Rice (GSR) varieties that can ensure a stable and sustainable yield potential even when supplemented with fewer inputs or even if grown in unfavourable environmental conditions. In contrast to the breeding approaches undertaken during Green Revolution, where all other traits took a backseat compared to high yield, the GSR researchers adopted a different methodology. In this, a large number of backcrossed second-generation lines (BC2F2) and their succeeding generations were subjected to a range of biotic and abiotic stresses, following which weak lines could be eliminated and promising transgressive high-performing segregants could be identified. Several GSR varieties that are drought-tolerant; salinity-tolerant; submergence-tolerant; weed-tolerant; suitable for rainfed lowlands; multiply resistant to rice blast, rice planthoppers and gall midge; zero-input (no fertilisers and no pesticides); and high-yielding are now ready for seed exchange and germplasm distribution. GSR varieties that combine several of these traits are also in the pipeline, such that they can perform well in the toughest conditions, providing a stable and sustainable solution to the poorest farmers who cultivate rice. In recent times, the focus of GSR research has considerably changed as the additional challenge of balancing food security with preservation of natural resources and protection of the environment has come to light. The achievable solution is development of improved GSR varieties and development and diffusion of conventional agricultural practices that are environment-friendly as well.