Introduction

With the completion of a high quality sequence of the rice genome (cv Nipponbare) (Matsumoto et al. 2005), this crop species has become the model for deciphering the molecular control of agronomical traits in cereals, motivating the concurrent development of large libraries of mutant lines using chemical, physical or insertion mutagenesis (reviewed in (Hirochika et al. 2004). Like in Arabidopsis, generation of insertion libraries in rice has largely relied on the use of the T-DNA (reviewed in (Guiderdoni et al. 2007) and of class II transposable elements, such as the maize Ac/Ds and En/Spm transposon systems (recently updated in (Zhu et al. 2007). Another powerful mutagen in rice is the class I ty1-copia retrotransposon Tos17, which naturally resides in rice cultivars in a low number of copies and the transpositional activity of which has been evidenced in the mid-90s (Hirochika et al. 1996). Tos17 activity is specifically induced by the stress resulting from tissue culture and, as the retroelement transposes through a copy and paste mechanism, its copy number gradually increases with the tissue culture duration (Hirochika 2001). This property has been profitably used for generating a library of 55,000 insertion lines, regenerated from cell aggregates in suspension at the National Institute for Agrobiological Sciences (NIAS) (Tsukuba, Japan). These lines have been evaluated in the field (Miyao et al. 2007) while more than 18,000 Tos17 insertion sites have been characterized (Miyao et al. 2003) (data available at http://tos.nias.affrc.go.jp/). Due to the cell culture procedure, each line harbours an average of 10 new Tos17 copies. Based on the frequency of tagging by Tos17 of alterations observed in field evaluated insertion lines, it was estimated that the retroelement causes 5–10% of the observed somaclonal variation (Hirochika 2001), a phenomenon extensively reported in rice plants deriving from callus and cell cultures.

As the Agrobacterium-mediated transformation procedure of rice involves tissue culture steps, T-DNA insertion lines may contain additional copies of the Tos17 element. This actually depends on the presence of active copies of the retroelement in the cultivar used to generate the T-DNA insertion library. In rice, Nipponbare, which was used to produce a reference high-quality sequence and to generate insertion libraries, is known to harbour two copies of Tos17, one of which is active and located on chromosome 7. On the other hand, the 3 native copies of Tos17 residing in another japonica cultivar, Tainung 67 exhibit a very low activity with an average of 0.2 new copy per T-DNA plant (Hsing et al. 2007). Though the presence of an active element in the T-DNA plants may first appear a drawback in term of generation of unwanted mutations, this may turns in an advantage once the insertion sites of the element have been characterized. Tos17 may for instance create alleles in genes already tagged by a T-DNA insert or tag new genes in the same genetic background. Tos17 and T-DNA alleles in the POSTECH insertion line library (Jeong et al. 2006) have for instance been used to ascertain the function of genes such as FLORAL ORGAN NUMBER 1 (Moon et al. 2006) and UNDEVELOPED TAPETUM 1 (Jung et al. 2005). Moreover, the preference of insertion of the two mutagens appear complementary: while the two mutagens are both more frequently found in gene-rich regions, T-DNA inserts are more frequent in promoter and 3′UTRs regions while Tos17 inserts in coding sequences (An et al. 2003; Hsing et al. 2007; Jeong et al. 2006; Miyao et al. 2003; Sallaud et al. 2004; Zhang et al. 2006).

In the frame of the French genomics initiative Génoplante, we have generated a library of 30,000 T-DNA insertion lines in cv. Nipponbare, which has been characterized for T-DNA integration sites (Sallaud et al. 2004), and is being seed increased and evaluated under agronomic conditions at the Centro Internacional de Agronomia Tropical (CIAT) in Colombia (Lorieux et al unpublished). The resulting phenotypic information -now collected from about 14,000 insertion lines and gathered in the Oryza Tag Line (OTL) database (http://urgi.versailles.inra.fr/OryzaTagLine and Larmande et al., submitted)- has evidenced that 22% of the lines exhibit alteration in at least one trait. Aside on going establishment of tagging by DNA blot analyses using T-DNA- and Tos17- specific probes in mutant lines altered for specific traits, large scale sequencing of the Tos17 insertion sites in the library may provide new clues on causative links between the predicted functions of the disrupted gene and phenotype alterations observed in the field.

To complement the characterization of the OTL library, we first estimated the range and average number as well as inheritance of new copies of Tos17 in the regenerated T-DNA plants using DNA blot analysis and a reverse transcriptase probe. In a first attempt to characterize insertion sites, we developed a protocol, called TOSTRAP, based on adapter-ligation PCR (Siebert et al. 1995) for both readily and specifically amplifying genomic regions flanking insertion sites of new Tos17 copies. Applying this protocol to a large part of the library, we generated 19,252 PCR products and 14,513 good sequences, which, once anchored on the rice pseudomolecules, corresponded to 11,689 unique sites. We then investigated the preference of insertion of Tos17 and compared it to that of T-DNA in the same materials and that of Tos17 in the NIAS Nipponbare lines. We also assessed the number and the functional classification of genes tagged both by T-DNA and Tos17 inserts in the library.

Materials and methods

Plant materials

The Génoplante insertion line library was generated by Agrobacterium-mediated transformation of mature seed embryo-derived callus of rice (Oryza sativa L.), japonica cultivar Nipponbare following the procedure detailed in (Sallaud et al. 2003). Seeds were kindly provided by M. Yano (RGP, NIAS Tsukuba, Japan). We hereafter briefly summarize the time course of the procedure and the resulting cell lineage of T-DNA plants (Fig. 1): Three weeks following inoculation of mature seeds on a 2,4-D containing induction medium, several (5 on average) globular somatic embryos are generally isolated from the scutellum-derived primary callus, and allowed to grow on fresh medium for 10–14 days, forming Embryogenic Nodular Units (ENUs) (Bec et al. 1998). ENUs are immersed in an EHA105 or LBA4404 Agrobacterium tumefaciens suspension, then blot dried and transferred to a solid co-cultivation medium for 3 days. Following 5 weeks of selection, several hygromycin-resistant cell lines may arise from a single co-cultivated ENU and further regenerate T-DNA plants (4 plants on average in cultivar Nipponbare) which are generally independent transformation events. T-DNA plants may therefore share a same ENU and/or primary callus origin.

Fig. 1
figure 1

Scheme summarising the tissue culture procedure used in the rice transformation protocol of the OTL T-DNA collection and illustrating the possibility for newly transposed Tos17 inserts (shown as stars) to sequentially accumulate in the cell lineages giving rise to independent transformation events and also explaining the observed partial redundancy in the sequences obtained of Tos 17 inserts

About 25,286 T-DNA plants of the library were used in this study. DNA was isolated from leaves of regenerated T-DNA plants (T0) at their transfer to test tubes (for PCR analysis) (Sallaud et al. 2004) and, for 384 lines randomly selected in the pilot study, from the antepenultimate leaf of the several tillers of adult T0 plants (for DNA blot analysis). To ascertain transmission of Tos17 inserts in progenies, leaf samples of 20 T1 progeny seedlings were collected and pooled, in 65 randomly selected insertion lines belonging to the 384 subset. The segregation of the inserts was further investigated by collecting leaves on 8–16 individual T1 progeny plants in 15 other T1 lines exhibiting a high Tos17 copy number.

DNA blot analysis

To investigate the integration and transmission of Tos17 copies in T-DNA lines, total genomic DNA was extracted from leaf tissue using the MATAB method. Five μg of DNA were digested with XbaI endonuclease which cuts once within the Tos17 sequence of both resident copies and DNA fragments were separated on 0.8% agarose gels in TBE buffer, and transferred to nylon membranes (Hybond N+ Amersham) through alkaline transfer. A 0.3 kbp PCR fragment of the Tos17 reverse transcriptase (5′ GCTTTTCTTCATGGTGA 3′ and 5′ CACGGGAGGAAGTATGA 3′) served as template for synthesizing [α-32P] labelled probes through random priming. Following hybridization, the membrane was washed twice (SSC 2×, SDS 1% and SSC 0.1×, SDS0.1% both at 65°C for 10 min) and analyzed by autoradiography.

Amplification of flanking regions of Tos17 inserts by adapter-ligation PCR

Amplification of regions flanking the 5′ and the 3′ LTR of Tos17 inserts was performed following an adapter ligation-PCR protocol (Siebert et al. 1995), adapted to plant DNA (Balzergue et al. 2001; Devic et al. 1997), and modified for Tos17. Briefly, in a 10μl reaction volume, 62,5 ng of genomic DNA was digested with 1U EcoRV (or SspI) blunt end restriction enzyme and ligated with 0.24 U of T4 ligase to 375 μM of asymmetric adapters (Table 1) in a single step at 25°C overnight. To determine the number of amplification products, two successive amplifications with nested primers specific to the 3′ (or 5′) Long Terminal Repeat (LTR) of Tos17 and to the adapters were performed according to conditions detailed in Table 1. PCR2 products generated from the 3′ LTR in EcoRV digests were sequenced by Genome Express company (Grenoble, France) and the National Sequencing Centre Génoscope (Evry, France) and were uploaded to the EMBL nucleotide sequence database under accession numbers ER893126 to ER893861 and CU313326 to CU326334 respectively.

Table 1 Conditions employed to selectively amplify the newly-transposed Tos17 elements in the OTL T-DNA rice collection

Analysis of Tos17 sequences

Data resources

The rice genome and its annotation were downloaded from the TIGR Rice Genome Database (Version 4.0, January 2006). Genomic DNA sequences corresponding to entire chromosomes were formatted into FASTA-format nucleotide database files for BLAST (Basic Local Alignment Search Tool). The genome annotations were also used for the insertion mapping and as input files integrated into the OryGenesDB database http://OryGenesDB.cirad.fr (Droc et al. 2006).

Tos17 FSTs (Miyao et al. 2003) and FL-cDNA sequences (The Full Length cDNA consortium, 2003) were obtained from GenBank at the National Center for Biotechnology Information (NCBI) with query “txid4530 [orgn] AND GSS [PROP] AND Tos17 AND NIAS” and http://cdna.01.dna.affrc.go.jp/cDNA/ respectively.

FST Mapping and annotation

Each FST was mapped to the rice genome by BLAST, and annotated with the TIGR gene code when the insertion qualifies as a ‘gene hit’. If a flanking sequence was mapped to multiple genomic loci, only the best mapping was retained. We define ‘gene hit’ as the insertion site being located between 1000 bp upstream of the ATG and 300 bp downstream of stop codon of a gene. Based on the genomic coordinate, Flanking sequences were determined to be identical if they had less than a 3 bp mismatch at the insertion site.

GC content

We evaluated the GC content of 1 kb regions extending 5′ and 3′ from the mutagen insertion point corresponding to different sets of FSTs (see results section). We then determined the distribution of the GC content for each population. (Fig. 5).

Gene Ontology annotation

Using the latest release of the Gene Ontology Slim versions (ftp://ftp.geneontology.org/pub/go/GO_slims/goslim_plant.obo), which give a high level view of the three GO ontologies, we classified genes which are annotated to the term itself, or to any of the children of the GO slim term, are included in the corresponding GO slim category.

Expression of Tos17 target genes in callus tissue

To determine the frequency of expression of Tos17 target genes in callus tissues, we based our search on the number of occurrences of Massive Parallel Signature Sequencing (MPSS) cDNA-derived tags in the NCA (callus) library at http://mpss.udel.edu/rice/, through bulk query using subsets of 150 gene accessions.

Software

The data are stored in a MySQL database management system in order to provide fast and flexible queries as well as to facilitate the analysis. The back end sequences analysis codes were written in the Perl programming language. R (R development Core Team, 2005) and Bioconductor packages (i.e. limma for Venn’s diagram, Fig. 6) were used for statistical computing and graphics.

Availability of plant materials

Seeds and related phenotype information of the insertion lines used to generate the FSTs in this study are publicly available. The T2 seeds are distributed once seed increase of the insertion lines is achieved. A Material Transfer Agreement and an order form are downloadable at http://urgi.versailles.inra.fr/OryzaTagLine. Lines can be identified from their sequence tags at http://OryGenesDB.cirad.fr, which now provides a direct link to the Oryza Tag Line database through the line ID.

Results

Southern analysis of a subset of 384 lines for number of newly transposed Tos17 copies

DNA blots of the 384 T-DNA plants and untransformed Nipponbare hybridized to the Tos17 specific reverse transcriptase probe shared a strong hybridization signal at 4 kbp which actually consists of two thinner bands of similar molecular weight (3.9 and 4.1 kbp) (Fig. 2A). These bands correspond to the two Tos17 native copies residing on chromosomes 10 and 7 respectively in cv. Nipponbare. In addition, the T-DNA plants exhibited 0–8 additional bands corresponding to putative newly-transposed Tos17 copies, the transposition of which was induced by the tissue culture steps of the rice transformation procedure. Distribution of these additional bands in the 384 subset is shown in Fig. 2B. A total of 1,295 putative new copies were found in the 384 subset of analysed lines, representing an average of 3.37 copies per line. Stability and transmission of these copies in the T1 progeny of primary transformants was ascertained by parallel DNA blot analysis of T0 plants and pooled samples of T1 plants in 65 lines (Supplementary Fig. 1A). While all the bands observed in T0 were found in T1 samples, no additional band, which would have resulted from new transposition events, was observed. Furthermore, T1 progeny plants of 15 T0 transformants harbouring a high number of Tos17 newly-transposed copies were individually analyzed by DNA blot and demonstrated that the copies observed in T0 segregated, as expected, independently in the T1 progeny (Supplementary Fig. 1B). From this analysis, we extrapolated that the 25,286 T-DNA lines regenerated in this study overall harbour 83,686 newly-transposed Tos17 copies.

Fig. 2
figure 2

(A) DNA blot analysis of Tos17 integration patterns in a set of 72 T-DNA plants (XbaI digests). The arrow points to the two hybridizing fragments of approx. 4 kbp corresponding to the hybridization signals of the two native (resident) copies of Tos17 in the genome of the Oryza sativa cv. Nipponbare. Nip.: Nipponbare control. (B) Distribution of frequency of additional hybridization signals corresponding to newly-inserted Tos17 copies, among the 384 analysed T-DNA plants

Amplification of Tos17 flanking regions through adapter ligation PCR in the 384 subset

To determine the fraction of these putative newly-transposed Tos17 copies from which a flanking genomic fragment can be recovered using adapter-ligation PCR, we used Tos17- and adapter- specific primers and genomic DNA digested by the blunt end restriction enzymes Ssp1 and EcoRV. Amplification with primers specific to the 5′ LTR of Tos17 was first investigated. As shown in Fig. 3A,B, all of the PCR2 products in Ssp1 and EcoRV digests, except the negative controls, exhibited a common fragment at 580 bp and 818 bp, respectively and indicative of the amplification of the region flanking the chromosome 7 Tos17 “native” copy residing in the Nipponbare cultivar (Fig. 3A). In Ssp1 digests, 80.2% of the samples exhibited additional PCR fragments (51.5% a single fragment and 48.5% multiple (2–5) additional fragments). A total of 523 PCR fragments, putative flanking regions of new Tos17 inserts, were amplified from the 384 subset, representing 40.4% of the 1,295 additional signals identified by DNA blot analysis.

Fig. 3
figure 3

(A) Positions of EcoRV and SspI restriction sites on 2 kbp regions extending from the 5′ and 3′ LTRs of the Tos17 elements located on chromosome 7 and 10 in the genome of the Oryza sativa variety Nipponbare (B) Results obtained using the walk PCR mediated amplification of regions flanking the 5′ LTR of Tos17 in Ssp1 and EcoRV digests of genomic DNA of T-DNA plants. Positions of the fragments corresponding to the flanking region of the chromosome 7 native copy are pointed by arrows. (C) Results obtained using the walk PCR mediated selective amplification of regions flanking the 3′ LTR of newly-transposed Tos17 copies in EcoRV digests of genomic DNA of T-DNA plants

In EcoRV digests, a lower frequency of 64.3% of the samples exhibited additional PCR fragments compared to the control Nipponbare (64.3% a single fragment and 35.7% multiple (2–4) fragments), likely reflecting the lower density of EcoRV sites in the rice genome. A total of 321 PCR fragments, representing 24.8% of the DNA blot signals were amplified from the 384 subset. Thirty seven samples with no additional fragments in Ssp1 digests had additional fragments in EcoRV digests while 121 samples with no additional fragment in EcoRV digests had additional fragments in Ssp1 digests. Manual picking and individual amplification in a third PCR of the additional bands observed in PCR2, followed by sequencing of the products proved to be possible (Supplementary Fig. 2), but with a fluctuating efficiency (60% on average) and found to be very time- and labour- consuming. We concluded that an automation of the process was needed before pursuing the recovery of flanking regions from the 5′LTR of Tos17 inserts in the whole T-DNA rice collection, although the recovery rate of PCR2 fragments from this border appeared satisfactory.

Contrastingly, amplification of flanking regions from the 3′ LTR of Tos17 in DNA samples digested with EcoRV, did not conduct to the amplification of PCR fragments from the Nipponbare control due to the absence of EcoRV sites in the immediate vicinity (2 Kbp) of the native copies residing on chromosomes 7 and 10 (Fig. 3A). Out of the 384 analysed T0 T-DNA plants 265 (69%) of them (Fig. 3C) exhibited PCR2 fragments (of which 50.2% with a single fragment and 49.8% with multiple (2–4) fragments). A total of 415 PCR fragments representing 32% of the DNA blot signals were amplified from the 384 subset of rice T-DNA lines. These 265 PCR products were readily sequenced and yielded 237 (89.4%) good quality sequences which were anchored on the rice pseudomolecules and represented a success rate of 0.62 compared to the original 384 lines under examination. From these results we extrapolated that the analysis of the whole collection of 25,286 T-DNA plants should yield 15,677 sequences, representing the 18.7 % of the estimated 83,686 newly-transposed Tos17 copies in the library.

Sequence and analysis of Tos17 flanking regions

Amplification of flanking regions from the 3′LTR of Tos17 inserts in the EcoRV digests of the 25,286 DNA samples yielded 19,252 PCR products (76.1%) of which 55.9% with single and 44.1% with multiple fragments, respectively.

Direct sequencing of single fragment conducted to a higher frequency of good quality sequences than direct sequencing of multiple PCR fragments. On the other hand, when the sequence was long enough, we were able to retrieve two distinct sequences anchoring in different points of the rice genome sequence in 4% of the multiple fragment PCR products. This indicates that the dominant sequence signal is that of the shorter sequence corresponding to the low molecular weight PCR fragment. Overall 98.6% of the obtained sequences were anchored to the rice pseudomolecules totalling 14,513 sequences positioned in the rice genome which represent a slightly lower number than anticipated from the 384 subset data.

FST redundancy

We found that 2,824 of the 14,513 sequences were redundant with at least one other sequence, with a FST start pointing exactly to a same interval of ±3 nucleotides on the rice genome sequence. Even though Tos17 insertion is prone to hot spots (see below), it is rather unlikely that two Tos17 integrations randomly occurred precisely at the same nucleotide position. Redundancy of sequences may therefore result from either cross contamination of DNA samples and PCR products or from sequencing the flanking regions of Tos17 inserts in plants with a shared cell lineage. As described in the Materials and Methods section, the co-cultivation procedure uses embryogenic nodular units (ENUs), which may derive from the same primary seed embryo scutellum callus, and later yield an average of 4 regenerating, hygromycin-resistant cell lines (Fig. 1). As the ENU lineage information was available in our materials, we examined whether the redundancy of sequences in several T-DNA plants can be explained by a shared cell lineage. Manual curation of the redundant sequences indicated that in most cases (88.2%), redundant Tos17 FSTs were shared between different T-DNA plants regenerated from the same ENU. The rest (11.8%) could be explained by either PCR contamination or ENUs sharing cell lineage in originating from the same primary callus. The former result indicates that Tos17 transposition occurred early during ENU formation and resulted in a Tos17 insert present in most cells of the co-cultivated ENU. These redundant sequences were further eliminated in the subsequent set of analyses which were therefore carried out with 11,689 unique Tos17 newly-transposed insertion points.

Distribution of Tos17 inserts among and along rice chromosomes

The preference of insertion of Tos17 (Miyao et al. 2003) as well as that of the T-DNA (An et al. 2003; Hsing et al. 2007; Jeong et al. 2006; Sallaud et al. 2004) were extensively documented in previous reports. On the other hand, parallel characterization of Tos17 and T-DNA insertional behaviours in the same regenerated plant material had yet to be examined. Comparative analysis revealed a similar distribution and chromosomal insertion preference for Tos17 inserts in the T-DNA OTL and Tos17 NIAS collections (Table 2). In both collections chromosomes 2, 1, 3 and 6 exhibited the highest insertion densities while chromosomes 10 and 11 showed lower insertion densities of Tos17 (Table 3). T-DNA plants with characterized Tos17 insertion sites were found to harbour 9,113 anchored T-DNA insertion sites (Table 2). These T-DNA inserts also exhibited the highest density on chromosomes 1, 2 and 3. The largest chromosomes 1, 2 and 3 were also found to bear the highest densities of non-TE predicted genes and of expressed genes (Table 3). The distribution along the rice chromosomes of the 11,689 Tos17 inserts and of the 9,113 unique T-DNA inserts characterized in lines also containing a characterized Tos17 insert is illustrated in Fig. 4 (for Chromosome 1) and Supplementary Fig. 3 (for all 12 chromosomes). Density graphs of Tos17 inserts of the OTL library paralleled those of Tos17 inserts characterized at NIAS with a higher frequency in subtelomeric, euchromatic regions and a lower frequency in pericentromeric, heterochromatic regions. Hot spots for integration were also generally lying in the same sequence intervals. On the other hand, T-DNA inserts were more evenly distributed along the chromosomes, being more frequently recovered in pericentromeric regions and less prone to hot spots for integration than Tos17.

Table 2 Distribution of newly-transposed Tos 17 inserts from the OTL insertion line library on the 12 rice chromosomes compared to those of Tos17 and T-DNA inserts in the NIAS and OTL libraries, respectively
Table 3 Densities of insertion in the 12 rice chromosomes of the newly-transposed Tos17 elements in the OTL library compared to those of Tos17 and T-DNA inserts in the NIAS and OTL libraries, respectively
Fig. 4
figure 4

Density graphs of (Up to Down tracks) NIAS Tos17 insertions, OTL Tos17 insertions, OTL T-DNA insertions and of FL-cDNA along chromosome 1. Position of the centromere is shown by a red square

Preference for insertion in genes and gene regions

The GC contents of Tos17 and T-DNA FSTs were similar averaging 41.4% and 40.6% respectively, slightly lower than the average GC content of the rice genome (43.6%). The GC range at Tos17 insertion sites was narrower than that of T-DNA insertion sites in the OTL lines, and found to parallel the GC content distribution of Tos17 insertion sites in the NIAS library (Fig. 5A). Tos17 insertion sites were mostly found in genic vs. intergenic regions (84.1%), in non-TE predicted genes (77.5%) and preferably in coding sequences (68.8%), compared to the unique T-DNA insertion sites characterized in the same plant material (53,5% in genic, 49.1% in non-TE predicted genes and 28.3% in coding sequences, respectively) (Table 4). As previously reported, T-DNA inserts were more frequently retrieved in promoter and 3′UTR regions than in coding sequences, strikingly differing in that from Tos17 inserts recovered in the same lines.

Fig. 5
figure 5

(A) Distribution of GC content of 1 kb regions extending 5′ and 3′ from OTL Tos17, NIAS Tos17 and OTL T-DNA insertion points. (B) Distribution of GC content of the 274 Tos17 hot spot genes and of a random set of 274 non hot spot Tos17 target genes

Table 4 Frequencies of insertion of Tos17 inserts from the OTL insertion line collection in the genic and intergenic regions of the genome of Oryza sativa variety Nipponbare

Complementary feature of Tos17 and T-DNA insertions

The 9,060 Tos17 newly-transposed elements inserted into non-TE genes were found to disrupt a total of 2,773 genes with an average of 3.27 inserts per gene, similar to that observed in the NIAS library (3.28 inserts per gene on average) whereas the 4,472 T-DNA elements inserted into non-TE genes in the same analyzed lines disrupted 3,911 genes (1.14 inserts per gene on average). The overlap between genes targeted by Tos17 and T-DNA in the OTL library appear to be limited, since genes interrupted by both mutagens in the library represented only 14.9% and 10.6% of the sets of genes interrupted by Tos17 and T-DNA inserts, respectively. On the other hand, a large fraction (48%) of the genes tagged by Tos17 inserts in the OTL T-DNA library were also found to be tagged by Tos17 in the NIAS collection (Fig. 6). This indicates that characterizing Tos17 inserts in a T-DNA rice collection allows a rapid increase in genome coverage through a highly-complementary tagging of additional genes. On the other hand, the chance of finding Tos17 mutant alleles in a gene already tagged by a T-DNA insert in the same genetic background may prove to be lower, as long as the overall genome coverage by both categories of inserts is limited. Genes tagged by Tos17 were classified through gene ontology (GO) and their distribution in different classes compared to that of the T-DNA inserts, the NIAS Tos17 inserts and the whole classified, non-TE rice gene complement. We did not detect any striking bias for a given broad GO category which would be specific to Tos17. However, some categories tended to be consistently over- and under-represented among Tos17 tagged genes when compared to T-DNA tagged ones in both the OTL and NIAS libraries (Supplementary Table 1). As Tos17 was reported to often integrate into defence response genes, due to their GC content falling into the target site narrow GC range and their frequent clustered organization along the chromosomes (Miyao et al. 2003), we investigated whether this class of genes was also over-represented in the genes interrupted by Tos17 inserts in the OTL library. Though representing 4.75% of the non-TE gene complement successfully classified through GO (n = 18,584) and 4.15% of the genes interrupted by T-DNA inserts in the OTL library, the defence response genes indeed represented 10.1% and 9.7% of the genes interrupted by Tos17 in the OTL and NIAS libraries respectively.

Fig. 6
figure 6

Venn’s diagram of rice genes tagged by OTL Tos17, NIAS Tos17 and OTL T-DNA inserts. 210 genes contain at least one insert in each of the three insert populations

Comparison of hot spots for integration in the OTL and NIAS libraries

Over the 2,773 genes interrupted by Tos17 inserts in the OTL T-DNA library, 444 genes were categorized as “hot spots” as they exhibited a range of 5 to 121 integrations of newly-transposed Tos17 elements in the analysed regenerated lines (Fig. 6). Interestingly, a majority of 274 (61.7%) of these 444 genes were also found to exhibit more than 5 inserts (precisely from 5 to 106) in the NIAS Tos17 library (Supplementary Table 2), while only 23 “hot spot genes” were not represented at all in the NIAS Tos17 FST database. On the other hand, over the 2,329 genes exhibiting a range of 1–4 Tos17 inserts in the OTL T-DNA collection, only 189 (8.1%) exhibited more than 5 integrations in the NIAS Tos17 library, whereas 1095 (47%) were not found in the NIAS FST collection. These findings indicates that the “hot spot genes” are generally found in both libraries and represent integration sites highly accessible to the Tos17 integration even in different culture and regeneration conditions. Native Tos17 copies residing on chromosomes 7 and 10 proved to be themselves hot spot for Tos17 integration since, applying the same filter than for non-TE genes, we observed 10 and 21 insertions respectively in the OTL library (11 and 25 respectively in the NIAS library).

High frequency insertions of Tos17, represented in both libraries, appear to occur in genes exhibiting a lower and narrower GC range than that of a random panel of non hot spot genes (42.8 ± 4.4 vs. 45.2 ± 6.6) (Fig. 5B). Positioning the 274 hot spot genes, represented in both OTL and NIAS library, on the rice genome, tends to accentuate the trends observed in the overall distribution of insertion sites notably the preference for subtelomeric regions of the chromosomes and low accessibility of short arms of chromosomes 4, 9 and 10 found to be heterochromatic regions (Fig. 7). Comparison of the density of hot spots genes compared to the density of the overall Tos17 insertion sites (Supplementary Table 3) allows to distinguish three classes of chromosomes where i. Both hot spot and insertion site densities are consistently either higher (Chromosomes 1, 2 and 3 and to a lesser extent Chromosome 6) or lower (Chromosomes 10, 11 and 12) than the genome average, following in that the density of FL-cDNA (see Table 2) ii. The density of hot spot genes is lower than that of the genome average whereas the insertion site density matches that of the genome average (Chromosome 5) iii. The density of hot spot genes is higher than that of the genome average whereas the insertion site density is lower than that of the genome average (Chromosome 4 -with all the hot spots located on the long arm- and Chromosome 7).

Fig. 7
figure 7

Distribution of Tos17 hot spot genes, found in both the OTL and NIAS libraries of insertion lines, on the rice chromosomes (i.e. exhibiting at least 5 insertions in each of the library). The number of cumulated insertions (bars) ranges from 10 to 227 but has been voluntarily framed to a 60 insertion threshold for better resolution

We successfully categorized through Gene Ontology 201 genes among the 274 which were prone to hot spots for Tos17 integration in both the OTL and NIAS libraries. Though the overall distribution of hot spot genes did not differ significantly from that of the 18,584 non-TE rice genes classified according to the same procedure (Supplementary Table 1), a detailed analysis of these genes showed a strong representation of specific gene families such as genes with NB-ARC domain (494 genes in rice, mainly disease response related such as the NBS LRR) or encoding LRR family protein. A closer examination of the distribution of hot spot genes in relation with their function shows that genes sharing similar function and organized in clusters can be preferred targets of high frequency insertion of Tos17. Genes belonging to receptor kinase family, LRR family, NB-ARC, D-mannose binding lectin family, Protein phosphatase, aminotransferase Terpene synthase family, Ubiquitin carboxyl-terminal hydrolase, B3 DNA binding domain, TPR domain containing protein notably fall in this category (Supplementary Table 4).

To determine whether genes expressed at a high level in callus tissue are more prone to high frequency integration of Tos17 than other genes, we made a bulk query with the 274 Tos17 hot spot genes, a random panel of 274 Tos17 non hot spot genes and of random panel of 750 non TE rice genes against the NCA (callus) MPSS library. Percentages of genes exhibiting at least one signature in the NCA library for the 3 sets of genes were of 11.6, 33.4 and 20.1 in the 3 respective sets (Supplementary Table 5). Whereas Tos17 target genes are more frequently expressed in callus than random genes, their levels of expression in callus based on the occurrence of signatures does not seem higher than those of random genes. Tos17 hot spot genes did not exhibit either higher frequency or higher level of expression in callus than random Tos17 target genes.

Discussion

In a first attempt to complement the characterization of Tos17 inserts existing in a T-DNA insertion library of rice, we developed a protocol aiming at efficiently and selectively recovering and sequencing the regions flanking the newly-transposed Tos17 elements, whose transposition was specifically induced by the tissue culture steps of the T-DNA transformation procedure. Our strategy took advantage of the absence of EcoRV restriction sites in the regions flanking the 3′ end of the two Tos17 copies resident in the genome of the rice variety Nipponbare. The selective amplification protocol was based on the use of EcoRV digests of genomic DNA followed by ligation of adapters and two rounds of PCR using nested Tos17 specific primers. The scaling up of this protocol proved to be highly efficient and it allowed the processing of 25,286 DNA samples in a short time span with limited manpower and led to the anchoring of 14,513 Tos17 inserts to the rice pseudomolecules. However, though satisfactory, this number yet represents only 17.3% of the total of 83,686 new inserts estimated in the library according to Southern blot data. To complement the coverage of the Tos17 inserts, further effort should therefore be dedicated to widen the range of enzymes used for digesting DNA in the walk PCR protocol—as shown in this study, but with the likely additional burden of picking the PCR fragments and running a third PCR- or the use of a different protocol such as TAIL-PCR (Miyao et al. 2003).

The average number of 3.37 new Tos17 inserts found in the OTL T-DNA collection appears lower and therefore more tractable for insert purification than the average of 10 inserts found in plants regenerated from cell aggregate suspension in the NIAS library. This number is equivalent to the average 4 inserts per line estimated in T-DNA lines of another japonica cultivar, Dong Jin, which has been used to generate the POSTECH T-DNA insertional mutant library (G.An, unpublished). On the other hand, the japonica cultivar Tainung 67, used to generate the TRIM T-DNA insertion library (Hsing et al. 2007) proved to contain poorly active Tos17 copy (ies) with an average of 0.2 new insert per T-DNA plant.

Some (19.5 %) of the 14,513 Tos17 FSTs were found to be redundant with at least one other sequence in the population. We showed that this redundancy can be mainly explained by a shared cell lineage of some lines from which the FSTs were recovered. In our transformation protocol, an average of 4, mostly independent T-DNA lines are indeed generally derived from a single, co-cultivated ENU in cv. Nipponbare (Sallaud et al. 2004; Sallaud et al. 2003). Though the characterization of Tos17 FSTs is far from exhaustive in the library, this indicates that a fraction of Tos17 insertions occurred at an early step of callus induction from the seed embryo scutellum, e.g. during divisions of the somatic proembryo which give rise to the ENU (Bec et al. 1998), while other insertions may have occurred later e.g. during the growth of hygromycin-resistant cell lines. Whether these successive insertions cause functional alterations, this finding is reminiscent of the phenomenon of sequential accumulation of somaclonal mutations reported during callus growth in cv. Nipponbare (Fukui 1983). The fact that the common cell lineage in T-DNA plants can be tracked back in our library may also readily provide valuable information on whether the phenotypic alterations observed during forward screenings are likely due to the T-DNA or have more likely to be related to a somaclonal variation source, including Tos17 (J.B. Morel, Pers. Com.). Extent of redundancy of FSTs due to a shared cell lineage has been estimated to be of 35% in the NIAS Tos17 insertion line library (Miyao et al. 2003).

Distributions of Tos17 inserts among and along the chromosomes in the OTL T-DNA collection were found to remarkably parallel those observed in the NIAS library. Hot spots for integration were notably found to lie in the same sequence intervals and often in the same genes. This shows that though two independent and different procedures were used to generate the libraries in cv. Nipponbare, the overall behaviour and target site specificity of the retroelement deduced from their respective FST populations was highly conserved. It would be nevertheless interesting to investigate whether these features are conserved in another cultivar harbouring an active copy, which might be different from that of chromosome 7, the active copy in cv. Nipponbare. An early concern was also that a bias might have been introduced by the flanking region recovery method itself, which notably involved the use of a relatively rare cutter unique restriction enzyme, EcoRV, in our study. The consistency of the results of distribution, target site specificity and preferred genes of Tos17 in the OTL library compared to those of Miyao and collaborators, who used a different protocol of recovery of flanking regions based on TAIL-PCR, proves that the findings can be considered robust and not biased.

Overall, we showed that the genes tagged by Tos17 inserts in the OTL population contain an average of 3.27 inserts per gene, consistent with what is observed in the NIAS library, whereas genes tagged by T-DNA contain an average of 1.14 insert. This means that an effort of insert characterization aiming at covering the whole gene complement has to be higher using Tos17 than the T-DNA. Genome saturation may prove not tractable with Tos17 alone not associated with another mutagen. On the other hand, Tos17 is more frequently inserted into coding sequences, thereby often creating putative KOs, than the T-DNA, which is more frequently recovered in promoter and 3′UTR regions. A consequence of the tendency of Tos17 to insert in preferred genes should also result in a higher occurrence of allelic series composed of Tos17 inserts rather than of both Tos17 and T-DNA inserts in the OTL library. On the other hand, the complementary gene regions preferred by T-DNA and Tos17 elements may result in a wider variety of changes in expression in a given gene tagged by both insertional mutagens (e.g; aside from KOs, creation of dominant negative mutants or mutants with changes in gene expression patterns).

We found many disease response and signal perception and transduction related genes which are known to be often organized in clusters along the chromosomes, are among the Tos17 preferred target genes, and that this organization in tandem series creates hot spot regions along the chromosomes. A striking example are the NB-ARC domain containing genes, with 14 representatives among the 274 very hot spots of the genome (5.1%) though they represent only 1.16 % of the non TE genes (494:42,653). Results presented here confirmed that insertion sites highly accessible to Tos17 exhibit a narrow GC content range. Taken together these results are consistent with the findings of Miyao and collaborators (Miyao et al. 2003), who attributed this preference to the overall clustering of these genes and the overlap of their GC content with the narrow GC content distribution of Tos17 target sites. Another hypothesis for explaining high frequency integration in certain categories of genes is that these are preferentially expressed in dedifferentiated cells during tissue culture and that Tos17 benefits from these windows of open chromatin structure to insert specifically in these genes. We showed that indeed Tos17 target genes exhibit a 2 to 3 fold higher expression frequency in callus tissues than random rice genes. However, Tos17 preferred genes do not appear to have a significantly higher level of expression in callus tissues than random genes. On the other hand, frequency and level of expression of genes hot spots for integration in callus did not differ from those of genes prone to single Tos17 insertions suggesting that high frequency insertion might be more related to specific GC content or DNA structure than to particular expression in callus tissues.

The large allelic series existing in genes which are Tos17 preferred targets (see Supplementary Table 2) also offers, when the phenotypic information of the corresponding lines is available, a straightforward means for establishing in silico links between gene disruption and phenotypes and provide a clue on gene function. Phenotypes of these lines are accessible at http://urgi.versailles.inra.fr/OryzaTagLine.

In conclusion, this first characterization of Tos17 inserts in the T-DNA library of rice using the TOSTRAP protocol proved very cost effective since it rapidly increased the range of novel genes tagged, not only because of the still low coverage of the genome by the inserts in the population but also because of an insertion preference in different types of genes than those tagged by the T-DNA. This indicates that advantage in characterizing Tos17 inserts in a T-DNA collection lies more in a highly complementary tagging of additional genes than in finding alleles in a same genetic background in genes already tagged by the T-DNA. This greatly enhances the library genome coverage and its overall value for implementing forward and reverse genetics strategies targeting specific traits.