Introduction

Arabidopsis thaliana has been adopted as a model organism in the study of plant biology because of its small size, short generation time, and high efficiency of transformation (Meinke et al. 1998). The whole genome sequence has been determined by the Arabidopsis Genome Initiative (AGI) (The Arabidopsis Genome Initiative 2000).

About 1,500,000 expressed sequence tags (ESTs) from Arabidopsis have been deposited in the EST database (dbEST) as of 5 January 2009, including sequences from large-scale EST projects in France (Höfte et al. 1993; Cooke et al. 1996), the Unites States (Newman et al. 1994; White et al. 2000), and Japan (Asamizu et al. 2000). These projects have produced EST data from different tissues, organs, and developmental stages (Höfte et al. 1993; Newman et al. 1994; Cooke et al. 1996; Asamizu et al. 2000; White et al. 2000). However, as of 1996, only about 50,000 Arabidopsis ESTs were registered (Höfte et al. 1993; Newman et al. 1994; Cooke et al. 1996) and most of these EST projects were based on cDNA libraries in which most of the inserts are not full-length. ESTs are useful for making a catalog of expressed genes, but not for further study of gene function. Consequently, genome-scale collections of the full-length cDNAs of expressed genes are important for the analysis of the structure and function of genes and their products in this era of functional genomics.

Since 1996, we have constructed Arabidopsis full-length cDNA libraries from plants grown under different conditions (Seki et al. 1998, 2002a) using the biotinylated CAP trapper method exploiting trehalose-thermoactivated reverse transcriptase (Carninci et al. 1996, 1997, 1998) and about 240,000 RIKEN Arabidopsis full-length (RAFL) clones have been isolated (Seki et al. 2002a; Sakurai et al. 2005). At present, there are numerous Arabidopsis full-length cDNAs produced and deposited in the GenBank database by other groups, such as Ceres (Haas et al. 2002), Genoscope (Castelli et al. 2004), and others. Information on these full-length cDNAs is available at http://www.arabidopsis.org/portals/masc/ORFeomics_2008Report.pdf. Full-length cDNAs have many advantages for improvement of genome annotation and functional genomics in the post-sequencing era (Fig. 1) (Seki et al. 2001a, 2002b, 2004a).

Fig. 1
figure 1

Application of RIKEN Arabidopsis thaliana full-length (RAFL) cDNAs to plant functional genomics

In this review, we summarize the present state and perspectives of analyses using RAFL cDNAs, including their collection and annotation, their application to expression profiling, and the structural and functional analysis of plant proteins.

Collection and sequencing of RAFL cDNAs

As reported previously, we have constructed Arabidopsis full-length cDNA libraries from plants grown under various stress, hormone and light conditions, from plants at various developmental stages, and from various plant tissues (Seki et al. 1998, 2002a) using the biotinylated CAP trapper method with trehalose-thermoactivated reverse transcriptase (Carninci et al. 1996, 1997, 1998). The overall strategy for preparing cDNA libraries, including standard, normalized, and subtracted libraries, has been described (Seki et al. 2001b). We have isolated about 240,000 RAFL cDNA clones, clustered into about 17,000 non-redundant cDNA groups, representing about 60% of all Arabidopsis predicted genes (Fig. 1; Seki et al. 2002a; Sakurai et al. 2005). Note that all Arabidopsis full-length cDNAs including the RAFL cDNAs are mapped on about 19,000 loci in the Arabidopsis genome.

Using the 5′-end sequences of mRNAs, promoter sequences can be obtained by comparison with Arabidopsis genomic sequences. We obtained 5′-ESTs of the RAFL cDNA clones and constructed a promoter database (Seki et al. 2002a; Sakurai et al. 2005) using the plant cis-acting regulatory DNA elements (PLACE) database (Higo et al. 1999). The Arabidopsis promoter database thus constructed contains information on genomic sequences 1,000-bp upstream from the 5′-terminus of each RAFL cDNA clone, and cis-acting elements known from plants, and is available as part of the RIKEN Arabidopsis Genome Encyclopedia (RARGE) database (http://rarge.gsc.riken.go.jp/; Sakurai et al. 2005). Several established plant promoter databases are also available today, such as the Arabidopsis Gene Regulatory Information Server (AGRIS, http://arabidopsis.med.ohio-state.edu.; Davuluri et al. 2003). Yamamoto et al. (2007) have applied local distribution of short sequences (LDSS) analysis to extract promoter constituents by genome-wide statistical analysis, and have identified 1,000 octamer sequences as LDSS-positive promoter elements. The information on core promoters thus extracted is available at the plant promoter database (PPDB, http://www.ppdb.gene.nagoya-u.ac.jp).

Although many algorithms have been written to predict a transcriptional unit (TU) from genomic sequence data, the accuracy of such predictions is still limited. A more direct and efficient approach to identify coding sequences is to sequence full-length cDNAs (Fig. 1). We have been determining the full-length sequences of the RAFL cDNA clones in collaboration with the Arabidopsis SSP group in the United States (Yamada et al. 2003), which comprises the Salk Institute [principal investigator (PI): J. R. Ecker], the Stanford Genome Technology Center (PI: R. W. Davis) and the Plant Gene Expression Center (PI: A. Theologis), and the Japanese group (K. Hanada et al., unpublished results), which comprises the RIKEN BioResource Center (BRC) (PI: M. Kobayashi), the National Institute of Genetics (PI: Y. Kohara) and the Genome Core Technology Facilities of RIKEN Genomic Sciences Center (GSC) (PI: Y. Sakaki). The RAFL cDNA clones are publicly available from the RIKEN BRC (http://www.brc.riken.go.jp/lab/epd/Eng/).

Application of full-length cDNAs to genome sequence annotation

Dramatic improvements in Arabidopsis genome sequence annotation have been achieved by mapping of RAFL cDNA sequences to the Arabidopsis genome (Seki et al. 2002a; Yamada et al. 2003).

Genome-wide analysis of alternative splicing events in Arabidopsis found that more than 4,700 transcribed pre-mature mRNAs were alternatively spliced (Iida et al. 2004; Wang and Brendel 2006). Iida et al. (2004) found that the pattern of alternative splicing events was affected by cold stress conditions. Recent full-length sequencing analysis of 1,800 RAFL cDNAs that had 5′- and/or 3′- sequences previously found to have alternative splicing events or alternative transcription start sites revealed the presence of 601 novel alternatively spliced/structure variant transcripts in Arabidopsis (Iida et al. 2009).

More than 1,000 overlapping sense-antisense (SAT) pairs have been identified by a genome-wide search of Arabidopsis cDNAs (Seki et al. 2004b; Jen et al. 2005; Wang et al. 2005). Antisense RNAs have been believed to control expression of sense transcripts negatively in plants (Borsani et al. 2005). Recently, we identified about 8,000 SAT pairs via Arabidopsis tiling array analysis under abiotic stresses. Many non-protein coding transcripts were found to belong to SAT transcripts, and the expression ratios (treated/untreated) of sense transcripts and the ratios of antisense transcripts showed a significant linear correlation (Matsui et al. 2008). Antisense RNAs have been shown to participate in a broad range of types of regulation, such as gene silencing, RNA stability, RNA editing, RNA masking, and methylation. Our recent tiling array analysis also demonstrated that several non-protein-coding antisense RNAs are suppressed by the nonsense-mediated mRNA decay (NMD) pathway (Kurihara et al. 2009).

One significant class of genes missing from the existing genome annotation is non-protein-coding RNAs. In addition to their role in protein synthesis (ribosomal and transfer RNAs), non-protein-coding RNAs have been implicated in control processes such as chromosomal silencing, transcriptional regulation, developmental control, and responses to stress (MacIntosh et al. 2001). Recently, we identified about 7,000 putative non-protein-coding RNAs in unannotated intergenic regions using an Arabidopsis Affymetrix tiling array (Matsui et al. 2008). These include non-protein-coding RNAs present 5′-upstream and 3′-downstream of AGI code genes. Interestingly, 27 promoter-associated short RNA (PASR)-like transcriptional units (TUs) (Kapranov et al. 2007) and 27 termini-associated short RNA (TASR)-like TUs (Kapranov et al. 2007), which are supported by full-length cDNAs, have been identified in the tiling array analysis (Fig. 2b, d; Matsui et al. 2008). Eight PASR-like TUs and ten TASR-like TUs that are supported by full-length cDNAs show ABA- or stress-responsive gene expression. Martianov et al. (2007) demonstrated that a non-protein-coding transcript upstream of the human dihydrofolate reductase (DHFR) gene has a critical function in transcriptional repression of the DHFR gene. Several novel PASR-like TUs in 5′-upstream regions might act as negative regulators of the downstream main TUs. Our tiling array analysis also showed that the 5′- and 3′-end regions of the 67 and 34 AGI code genes (Fig. 2a, c; Matsui et al. 2008), respectively, are shorter in a previous TAIR6 gene model than the gene model of the AGI code genes detected by the “ARTADE” (Arabidopsis tiling array-based detection of exons) program (Toyoda and Shinozaki 2005), as also supported by full-length cDNAs. These results show that the tiling array is also a useful tool for improvement of genome sequence annotation.

Fig. 2
figure 2

New gene models supported by full-length cDNAs and tiling array analysis. a, c Arabidopsis genome initiative (AGI) code genes whose 5′-end- (a) or 3′-end- (c) regions are short in the TAIR 6 gene model. b Promoter-associated short RNA (PASR)-like transcriptional units (TUs). The full-length cDNA sequences support the non-AGI TUs mapped on the promoter region of the AGI code genes. d Termini-associated short RNA (TASR)-like TUs. The full-length cDNA sequences support the non-AGI TUs mapped around the 3′-termini of the AGI code genes. The lower panels represent examples of the tiling array expression data supporting the new gene models (Matsui et al. 2008; http://omicspace.riken.jp/gps/group/psca1)

RAFL cDNA microarray analysis

cDNA microarrays are a powerful tool for the systematic analysis of expression profiles of large numbers of genes, including stress-inducible gene expression and changes in the expression profiles of mutants or transgenics (Seki et al. 2004a). One interesting type of application of microarray analysis is the identification of novel cis-elements that regulate the expression of genes in response to various experimental treatments (Simpson et al. 2003). By identifying subsets of the genes that have a common expression profile, it might be possible to identify conserved motifs in promoter regions. For example, promoter databases have been used for systematic analysis of cis-acting elements in Arabidopsis (Fig. 1).

We prepared the following two types of cDNA microarray: (1) a 1.3 K RAFL cDNA microarray (Seki et al. 2001a) containing about 1,300 RAFL cDNA clones, and (2) a 7K RAFL cDNA microarray (Seki et al. 2002b) containing about 7,000 RAFL cDNA clones. Using these cDNA microarrays, we have studied the expression profiles of Arabidopsis genes under various stress conditions (Fig. 1), such as drought, cold, and high-salinity-stresses (Seki et al. 2001a, 2002b), and high light stress (Kimura et al. 2003), as well as various treatment conditions, such as abscisic acid (ABA) (Seki et al. 2002c), rehydration treatment after dehydration (Oono et al. 2003), ethylene (Narusaka et al. 2003), jasmonic acid (JA) (Narusaka et al. 2003), salicylic acid (SA) (Narusaka et al. 2003), reactive oxygen species (ROS)-inducing compounds such as paraquat and rose bengal (Narusaka et al. 2003), UV-C (Narusaka et al. 2003), proline (Pro) (Satoh et al. 2002), and inoculation with pathogen (Narusaka et al. 2003). We have also studied expression profiles in various mutants and transgenic plants (Fig. 1; Seki et al. 2001a; Osakabe et al. 2002; Abe et al. 2003; Dubouzet et al. 2003; Nanjo et al. 2003; Chini et al. 2004; Kamei et al. 2005; Noutoshi et al. 2005; Osakabe et al. 2005). Note that various types of oligonucleotide DNA microarrays that are available from Affymetrix (http://www.affymetrix.com/products_services/index.affx#1_1) and Agilent Technologies (http://www.chem.agilent.com/en-US/products/instruments/dnamicroarrays/Pages/default.aspx) etc. have been widely used recently instead of cDNA microarrays. This might be due to the fact that oligonucleotide DNA microarrays are superior to cDNA microarrays in terms of the number of genes that are contained on each microarray, as well as easier management of the microarray system. The oligonucleotide microarrays have been prepared using sequence information from the updated gene models of the Arabidopsis genome. Expression profiling studies using these microarrays have shown the expression levels of many genes as a detailed snapshot describing the state of a biological system in plants under certain conditions.

Identification of genes regulated by drought, cold, high-salinity-stress or abscisic acid

Plant growth is affected greatly by environmental abiotic stresses, such as drought, high salinity, and low temperature. Plants respond and adapt to these stresses in order to survive. These stresses induce various biochemical and physiological responses in plants. Several thousand genes have been identified that respond to drought, high-salinity or cold stress at the transcriptional level (Thomashow 1999; Hasegawa et al. 2000; Seki et al. 2002b; Zhu 2002; Matsui et al. 2008). It is important to study the function of stress-inducible genes not only to understand the molecular mechanisms of stress tolerance and responses in plants but also to improve stress tolerance by genetic engineering. Stress-inducible genes have been used to improve the stress tolerance of plants by gene transfer (Thomashow 1999; Hasegawa et al. 2000; Shinozaki and Yamaguchi-Shinozaki 2000).

Several years ago, we prepared a full-length cDNA microarray (7K RAFL cDNA microarray) containing ca. 7,000 independent Arabidopsis full-length cDNA groups (Seki et al. 2002b), and applied the 7K RAFL cDNA microarray to identify new drought-, cold-, high-salinity- or abscisic acid (ABA)-inducible genes. We identified 299 drought-inducible genes, 54 cold-inducible genes, 213 high-salinity-stress-inducible genes and 245 ABA-inducible genes (Seki et al. 2002b, c). Venn diagram analysis indicated the existence of significant crosstalk between drought and high-salinity stress signaling processes (Seki et al. 2002b). Many ABA-inducible genes are induced after drought- and high-salinity-stress treatments, which indicates the existence of significant crosstalk between drought and ABA responses (Seki et al. 2002c). These results indicate the presence of strong overlaps of gene expression in response to drought, high-salinity, and ABA (Shinozaki and Yamaguchi-Shinozaki 2000), and partial overlap of gene expression in response to cold and osmotic stress.

The products of the drought-, high-salinity- or cold-stress-inducible gene products can be classified into two groups (Fig. 3; Shinozaki and Yamaguchi-Shinozaki 2000; Seki et al. 2002b). The first group includes functional proteins, or proteins that probably function in stress tolerance. They are late-embryogenesis abundant (LEA) proteins, heat shock proteins, KIN (cold-inducible) proteins, osmoprotectant-biosynthesis-related proteins, carbohydrate-metabolism-related proteins, water channel proteins, sugar transporters, potassium transporters, detoxification enzymes, proteases, senescence-related proteins, protease inhibitors, ferritin, and lipid transfer proteins (Seki et al. 2002b).

Fig. 3
figure 3

Drought-, cold- and high-salinity-stress-inducible genes and their possible functions in stress tolerance and response

The second group contains regulatory proteins, that is, protein factors involved in further regulation of signal transduction and gene expression that probably function in the response to stress (Shinozaki and Yamaguchi-Shinozaki 2000; Seki et al. 2002b, c). These include various transcription factors, protein kinases, protein phosphatases, enzymes involved in phospholipid metabolism, and other signaling molecules such as calmodulin-binding protein (Seki et al. 2002b, c). We identified many stress-inducible transcription factor (TF) genes, such as dehydration-responsive element (DRE)-binding protein (DREB), ethylene-responsive element binding factor (ERF), zinc finger, WRKY, MYB, basic helix-loop-helix (bHLH), bZIP, NAC and homeodomain-leucine zipper (HD-ZIP) TF genes, suggesting that various transcriptional regulatory mechanisms function in the drought-, cold- or high-salinity-stress signal transduction pathways (Seki et al. 2002b, c). These transcription factors probably regulate various stress-inducible genes cooperatively or separately.

Identification of candidate genes regulated by stress-inducible transcription factors

Transcriptional activation of some stress-responsive genes, such as the RD29A/COR78/LTI78 gene (responsive to dehydration/cold-regulated/low-temperature-induced) is well understood. The promoter of this gene contains both an ABRE (abscisic acid-responsive element) and a DRE/CRT (dehydration responsive element/C-repeat) (Yamaguchi-Shinozaki and Shinozaki 2005, 2006). ABRE and DRE/CRT are cis-acting elements that function in ABA-dependent and ABA-independent gene expression in response to stress, respectively. Transcription factors belonging to the ERF/AP2 (ethylene-responsive element binding factor/apetala 2) family that bind to DRE/CRT were isolated and termed DREB1/CBF (DRE-binding protein 1/C-repeat-binding factor) and DREB2 (Yamaguchi-Shinozaki and Shinozaki 2005, 2006). Their conserved DNA-binding motif is A/GCCGAC. The DREB1/CBF genes are rapidly and transiently induced in response to cold stress, and these transcription factors in turn activate the expression of target genes.

Overexpression of the DREB1A/CBF3 cDNA under the control of the cauliflower mosaic virus (CaMV) 35S promoter or the stress-inducible rd29A promoter in transgenic plants gave rise to strong constitutive expression of stress-inducible DREB1A target genes, and increased tolerance to freezing and drought stresses (Jaglo-Ottosen et al. 1998; Liu et al. 1998; Kasuga et al. 1999). Kasuga et al. (1999) identified six DREB1A target genes. However, it remains poorly understood how overexpression of the DREB1A cDNA in transgenic plants increases stress tolerance to freezing and drought stresses. We applied the RAFL cDNA microarrays to identify new target genes of DREB1A (Seki et al. 2001a; Maruyama et al. 2004) and identified more than 40 DREB1A target genes. The downstream target genes include C2H2 zinc-finger-type- and ERF/AP2-type-TFs, RNA-binding proteins, sugar transport proteins, LEA proteins, KIN proteins, RFO (raffinose family oligosaccharides)-biosynthesis-related proteins, and protease inhibitors. Conserved sequences in the promoter regions of the DREB1A/CBF3 target genes were searched, and A/GCCGACNT was found in their promoter regions between −51 and −450 as a consensus DRE (Maruyama et al. 2004). These results showed that the DNA microarray is a useful system with which to identify target genes of stress-related transcription factors and potential cis-acting DNA elements by combining expression data with genomic sequence data.

We have also applied the RAFL cDNA microarray to identify the target genes of the following stress-related transcription factors: ERF/AP2 TF family, e.g., DREB2A (Sakuma et al. 2006); bZIP TF family, e.g., AREB1 (Fujita et al. 2005); MYB TF family, e.g., AtMYB2 (Abe et al. 2003); bHLH TF family, e.g., AtMYC2 (Abe et al. 2003); NAC TF family, e.g., RD26/ANAC072 (Fujita et al. 2004; Tran et al. 2004), ANAC019 (Tran et al. 2004) and ANAC055 (Tran et al. 2004). The roles of TFs in the abiotic stress signaling and the expression profiling results are summarized in recent reviews (Bartels and Sunkar 2005; Seki et al. 2005; Yamaguchi-Shinozaki and Shinozaki 2005, 2006). Information on the target genes is useful for understanding the transcriptional regulatory networks involved in cellular responses to abiotic stresses.

Transcriptome analysis in the recovery process following stress

Analysis of genes involved in the recovery from stress as well as stress-inducible genes is also important, not only for the understanding of the molecular responses to abiotic stresses but also for improving the stress tolerance of crops by gene manipulation. Oono et al. (2003) applied the analysis of the 7K RAFL cDNA microarray to the identification of genes that are induced during the rehydration process after dehydration stress treatment, and identified 152 rehydration-inducible genes. These genes can be classified into the following three major groups: (1) regulatory proteins involved in further regulation of signal transduction and gene expression, (2) functional proteins involved in the recovery process after dehydration-induced damage, and (3) functional proteins involved in plant growth (Oono et al. 2003). Venn diagram analysis also showed that among the rehydration-inducible genes, at least two gene groups existed, i.e., genes functioning in adjustment of cellular osmotic conditions and those functioning in the repair of drought-stress-induced damage, and that most of the rehydration-downregulated genes are dehydration-inducible (Oono et al. 2003).

Oono et al. (2006) analyzed the gene expression profiles in the process of cold acclimation and deacclimation (recovery from cold stress) using two microarray systems: the 7K RAFL cDNA microarray and the Agilent 22K oligonucleotide array. Both microarray analyses identified 292 genes up-regulated and 320 genes down-regulated during deacclimation, and 445 cold up-regulated genes and 341 cold down-regulated genes during cold acclimation. Many genes up-regulated during deacclimation were found to be down-regulated during cold acclimation, and vice versa.

Application of RAFL cDNAs to functional analysis of proteins

Endo’s group at Ehime University has established an efficient wheat germ cell-free protein synthesis system to produce milligram quantities of proteins (Madin et al. 2000; Sawasaki et al. 2002). We have applied the wheat germ cell-free protein synthesis system using the RAFL cDNAs to study the functional characteristics of Arabidopsis proteins (Fig. 1). Sawasaki et al. (2004) used about 400 RAFL cDNAs encoding protein kinases for protein synthesis using the wheat germ system. The assay revealed about 200 products with autophosphorylation activity. Seven proteins out of 26 calcium-dependent protein kinases phosphorylated a synthetic peptide substrate in the presence of calcium ions, demonstrating that the translation products retained their substrate specificity. Recently, we also demonstrated that the wheat germ system is useful for studying the biochemical characteristics of TFs (T. Sawasaki et al. unpublished results) and proteins involved in ubiquitination (Takahashi et al. 2009).

We also applied the RAFL cDNAs to a gain-of-function gene hunting technique—full-length cDNA over-expressing (FOX) gene hunting system (Ichikawa et al. 2006; Fig. 1)—that involves the random overexpression of a normalized full-length cDNA library. Ichikawa et al. (2006) introduced about 10,000 independent RAFL cDNAs under the control of the CaMV 35S promoter into Arabidopsis, and found about 1,500 possible morphological mutants with various phenotypes, e.g., pale green, dwarf, and bushy phenotypes, from about 15,000 transformants. Fujita et al. (2007) focused on stress-inducible TFs; the full-length cDNAs of 43 stress-inducible TFs were mixed to create the FOX lines. After screening for salt-stress-resistant lines, a number of salt-tolerant lines were found to harbor a bZIP-type transcription factor (AtbZIP60) involved in the endoplasmic reticulum stress response. Full-length rice cDNAs have been introduced into Arabidopsis (Kondou et al. 2009) and rice (Nakamura et al. 2007) plants using the FOX system, and many FOX lines showing altered growth or morphological characteristics, such as super-dwarf mutants, have been obtained. These results demonstrate that the FOX system is useful method to screen for genes with valuable functions. Ogawa et al. (2008) introduced 96 metabolism-related RAFL cDNAs into Arabidopsis suspension-cultured T87 cells by Agrobacterium-mediated transformation to study the plant metabolome.

ORFeome clones, that is, cDNA clones containing full-length open reading frames (ORFs) are a valuable research tool for functional proteomics. In collaboration with the Arabidopsis SSP group in the United States (Yamada et al. 2003), we used the RAFL cDNAs to construct the U (pUNI) clone, an Arabidopsis ORFeome clone. We have constructed about 10,500 U clones and determined full-length sequences of the intact ORF regions for confirmation of error-free ORFeome clones. The U clones are publicly available from the Arabidopsis Biological Resource Center (ABRC; http://www.biosci.ohio-state.edu/~plantbio/Facilities/abrc/abrchome.htm). Several groups, such as the Yale group (Gong et al. 2004), also have constructed various Arabidopsis ORFeome clones. Information on ORFeome clones is available at http://www.arabidopsis.org/portals/masc/ORFeomics_2008Report.pdf.

Application of RAFL cDNAs to structural analysis of proteins

Full-length cDNAs are useful resources for determining the three-dimensional structures of proteins by X-ray crystallography and NMR spectroscopy (Seki et al. 2001b) (Fig. 1). We have determined the three-dimensional structures of plant proteins using Arabidopsis full-length cDNAs by NMR spectroscopy in the RIKEN Structural Genomics Initiative (Yokoyama et al. 2000), using cell-free protein synthesis systems for protein expression. Cell-free in vitro systems have three advantages over conventional in vivo expression systems: (1) cell-free systems are suitable for automated, high-throughput expression, as proteins can be produced without the need for cloning genes into expression vectors; (2) milligram quantities of proteins can be obtained in several hours; and (3) proteins that are difficult to express in vivo can be produced in vitro.

We have applied this system to plant protein expression and determined the domain structure of 29 proteins containing plant-specific-type TFs, such as the DNA-binding domain of squamosa promoter-binding protein (Yamasaki et al. 2004b, 2006), the B3 DNA-binding domain of the cold-responsive transcription factor RAV1 (RAV for related to ABI3/VP1) (Yamasaki et al. 2004a), the C-terminal WRKY domain of the WRKY4 protein (Yamasaki et al. 2005a), and the DNA-binding domain of an ethylene-insensitive 3 (EIN3) protein, EIL3 (Yamasaki et al. 2005b). Determination of the three-dimensional structure of the DNA-binding domains of stress-inducible TFs might be applied to alter the target genes for improvement of stress tolerance.

Conclusions and future perspectives

We have demonstrated that full-length cDNAs are an important resource for improvement of genome sequence annotation, expression profiling studies, and functional and structural analysis of plant proteins in the post-sequencing era. The RAFL cDNA clones are publicly available from the RIKEN BRC, and the BRC has distributed about 24,000 RAFL cDNAs to more than 500 labs as of 22 October 2008. The RAFL cDNAs are a standard resource in the Arabidopsis research community.

Once all the Arabidopsis full-length cDNAs are functionally annotated, the database will be the Rosetta Stone for understanding the network of gene functions in higher plants (Appels et al. 2003; Clarke et al. 2003). Information gained from these full-length cDNAs can be applied to other crops, such as rice (Kikuchi et al. 2003), wheat (Ogihara et al. 2004), barley (Sato et al. 2009), soybean (Umezawa et al. 2008), and cassava (Sakurai et al. 2007), to trees, such as poplar (Nanjo et al. 2007), Cryptomeria japonica (Futamura et al. 2008), and Sitka spruce (Ralph et al. 2008), and to model plants, such as Physcomitrella patens (Nishiyama et al. 2003) and Thellungiella halophia (Taji et al. 2008). Full-length cDNAs will be used for improvement of plants in the future.

Recently, the SABRE (systematic consolidation of Arabidopsis and other botanical resource) database (http://saber.epd.brc.riken.jp/sabre/SABRE0101.cgi) has been developed to provide organized information on plant full-length cDNA resources that are available from RIKEN BRC. The SABRE database will help researchers access counterpart full-length cDNA resources in other plant species for basic and applied science.

Complete genome sequences of various plant species, such as rice (International Rice Genome Sequencing Project 2005), poplar (Tuskan et al. 2006), grapevine (The French-Italian Public Consortium for Grapevine Genome Characterization 2007) and Physcomitrella (Rensing et al. 2008) have been determined. Recently, next generation sequencers that are available from 454 Life Sciences (http://www.454.com/; Margulies et al. 2005), Illumina (http://www.illumina.com/) and Applied Biosystems (http://www.appliedbiosystems.com/) have been applied to whole-genome sequencing in various plant species, and to the identification of whole-genome sequence variation in 1,001 natural strains of Arabidopsis (http://1001genomes.org; Ossowski et al. 2008). Paired-end diTag (PET; Ruan et al. 2007) analysis, a useful method to characterize both ends of DNA fragments, using such next generation sequencers, will be applied to the analysis of the full-length cDNAs of many plant species and strains in the future, and will contribute to the discovery of useful genes and our understanding of natural variation and evolution in plants.