Introduction

Powdery mildew fungi are some of the most damaging plant pathogens. They affect a wide range of dicotyledonous and monocotyledonous host species and cause significant economic losses in crop plants worldwide (Glawe 2008). Powdery mildews belong to the family Erysiphaceae in the order Erysiphales (Ascomycota) (Inuma et al. 2007). Their interactions with the host are characterized by the establishment of structures called haustoria inside epidermal plant cells, allowing the pathogen to maintain a parasitic relationship and to take up nutrients from the host. This results in a complete dependence of powdery mildew growth on living plant cells (Glawe 2008).

The fungal pathogen Blumeria graminis is an ascomycete species subdivided in seven formae speciales (ff. spp.), each highly specialized for different host species (Inuma et al. 2007). B. graminis f.sp. tritici (hereafter called Bg tritici) is the causal agent of powdery mildew on wheat (Triticum aestivum L.). Little is known about the biology of this fungus and, therefore, methods and resources are needed to identify genes promoting virulence and determining Bg tritici–wheat interaction and to understand the mechanisms underlying host specialization of B. graminis.

Recently, the genome sequencing of B. graminis f.sp. hordei (Bg hordei), closely related to Bg tritici and causal agent of the powdery mildew of barley (Hordeum vulgare), has been completed (Spanu et al. 2010). This work, together with the reports of other obligate biotroph genome sequences (Baxter et al. 2010; Duplessis et al. 2011) revealed genomic hallmarks possibly driven by adaptations to the obligate biotrophic lifestyle. Those include a massive proliferation of transposable elements correlated with expansion of the genome size and the loss of genes which are not essential for the biotrophic lifestyle, such as genes encoding enzymes devoted to plant cell wall degradation or nitrate and sulfur assimilation pathways (Spanu et al. 2010; Baxter et al. 2010; Duplessis et al. 2011).

In order to determine its genomic features, we initiated the exploration of the Bg tritici genome with the construction and characterization of the first bacterial artificial chromosome (BAC) library from this fungus. We fingerprinted the library and produced a physical map of the genome which allowed a first estimation of the genome size. Based on low-pass 454 sequencing of the genome and 20,001 BES representing approximately 7% of the nuclear genome, we were able to build a Blumeria repeat database and to obtain a first insight into the Bg tritici genome.

Materials and methods

Plant and fungal material

The construction of the BAC library and 454 sequencing were performed using DNA from Bg tritici isolate 96224 (Brunner et al. 2010). Cultures of 96224 were propagated by infecting fresh leaf segments of the susceptible bread wheat cultivar Kanzler, kept on agar supplemented with benzimidazole at a concentration of 30 mg/L.

BAC library

Construction and characterization of the BAC library are described in the Supplementary Text.

Assembly of a physical contig map of Bg tritici

Fingerprinting was performed at the Instituto di Genomica Applicata (http://www.appliedgenomics.org). High information content fingerprints (HICF) were produced and processed through FPB software (Scalabrin et al. 2009) for fingerprint background removal and GenoProfiler software (You et al. 2007) for removal of contaminants and batch processing of fingerprints into size files that can be input into FPC (Soderlund et al. 1997). Fingerprinted clones were initially assembled using FPC at a Sulston cutoff score of 1e−60 (initial incremental contig build) and Q-clones were split using three DQ steps at slightly lower Sulston scores. Singleton clones were then added to contigs, and ends were merged (when applicable) by increasing the cutoff score by 1e−5 in a stepwise manner to 1e−20 (final cutoff). The approach to control experimentally the accuracy of the FPC assembly is described in Supplementary Text.

BAC-end sequencing

BAC-end sequencing was made at the Arizona Genomics Institute, University of Arizona (www.genome.arizona.edu). Sequencing was performed at both ends. Sequence trimming was conducted by processing trace files using the Phred program for base calling and a quality score of 20 (Ewing et al. 1998). Vector sequences were masked using CROSS_MATCH (www.genome.washington.edu) and removed from the analysis. Only reads with a length of at least 100 bp were retained, providing 20,001 high-quality BAC-end sequences.

Construction of the Blumeria repeat database

The low-pass genome sequencing of the Bg tritici isolate 96224 was performed using the GS FLX platform (Roche) (Supplementary Text). Reads were assembled using the MIRA software with default settings for assembly of 454 sequences. Contigs with a 10–25× coverage and a minimal length of 7 kb were used for the manual characterization of full-length transposable element (TE) sequences.

The strategy for the identification of TEs was the following: BLASTN and BLASTX searches (Altschul et al. 1997) against specialized databases such as RepBase (www.girinst.org) and TREP (wheat.pw.usda.gov/ITMI/Repeats/) were performed in order to reveal typical features characterizing the different superfamilies of TEs. Long interspersed nuclear elements (LINE) were identified by their generally well-conserved ORF2 sequence. The presence of associated ORF1 and poly-A sequences allowed further identification of complete elements. Short interspersed nuclear elements (SINE) were identified by the presence of internal A and B promoter boxes necessary for RNA polymerase III binding as well as a poly-A tail at the 3′ end. For long terminal repeat (LTR) retrotransposons, typical patterns of the terminal repeats were revealed using DOTTER (Sonnhammer and Durbin 1995). Target site duplications and LTR borders were determined manually. The classification into copia or gypsy superfamilies was done according to similarity of the ORF-encoded proteins with the PTREP database, and their internal organization within the element (Wicker et al. 2007). Additionally, we used contigs of the Bg hordei draft genome (version June 2007) which were made available for us by the BluGen consortium (www.blugen.org) for homology search to identify the Bg hordei homologs of Bg tritici repeats.

In order to reduce redundancy within the different families, we set a threshold of 80% similarity at the nucleotide level for the definition of a family. Finally, elements were named according to the nomenclature of Wicker et al. (2007).

BES analysis

The 20,001 BES were first analyzed for their repeat content through BLASTN and BLASTX searches (Altschul et al. 1997) against the Blumeria repeat database. Only hits with a minimal alignment of 100 bp, 80% of nucleotide identity (for BLASTN) and an E value <10 e−10 (for BLASTX) were considered. For the identification of additional high-copy sequences, sequences matching the repeat database were removed, and the remaining ones were searched against themselves using the same BLASTN parameters.

Access to sequence data

All BAC-end sequences can be accessed through accession numbers FR776010 to FR796010 in the EMBL nucleotide sequence database. An FTP server (address available on request) provides access to the complete set of sequences of the 56 identified Bg repeats (files Bg_repeats_fasta and Bg_repeats_hypothetical_proteins_fasta).

Results

Fingerprinting of the Bg tritici BAC library provides a physical map of the genome and an estimate of the minimal genome size

A large insert BAC library was constructed with Bg tritici reference isolate 96224 (Supplementary Text). Fingerprinting of the complete library (12,288 clones) generated 6,831 HICF which were assembled to produce 266 BAC contigs (Table 1). Only 146 (2.1%) BAC clones remained as singletons. The largest contig is 5.8 Mb, and 50% of the assembly is contained in contigs larger than 1 Mb. By comparison with experimentally tested overlaps of BAC clones at two genomic regions (Supplementary Fig. 3 and Supplementary Table 1), we could confirm the accuracy of the fingerprint assembly and its relevance for establishing contigs spanning large genomic regions. The total length of the assembly is 174 Mb, giving a first estimate of the Bg tritici minimal genome size.

Table 1 Characteristics of the Bg tritici contig assembly

Construction of a Blumeria repeat database

In order to study the fraction of repetitive DNA in the Bg tritici genome, we established a Blumeria repeat database, exploiting two datasets of sequence information. First, whole genome sequencing of the Bg tritici genome was carried out by one full 454 GS FLX run. This resulted in 491,163 reads with an average size of 226 bp. Assembly of these reads produced 39,363 contigs and contigs with a very high coverage were selected, as this indicates sequences corresponding to high-copy repeats. Additionally, we also exploited few contigs belonging to the first Bg hordei draft genome sequence (version June 2007) which were made available to us by the BluGen consortium (www.blugen.org).

Composition of the Blumeria repeat database is presented in Table 2. We identified 20 families of LINEs and two Bg tritici SINEs, Bgt_RSX_Yhi and Bgt_RSX_Lie, homologs of the previously characterized Bg hordei SINE elements EGH-24-1 (Rasmussen et al. 1993) and EG-R1 (Wei et al. 1996), respectively. A total of 27 LTR retrotransposons were found (Table 2), of which 13 families could be classified as members of the gypsy superfamily and nine as members of the copia superfamily. Five sequences showed characteristics of solo LTRs, but the complete retrotransposon they originated from could not be characterized. Finally, seven sequences exhibited characteristics of TE and a high-copy number, but could not be classified into any order of repeat (“unclassified” in Table 2). Among them were two Bg tritici sequences for which we could identify two homologous sequences in Bg hordei (both Bg tritici and Bg hordei homologs are in the database).

Table 2 Transposable element families of the Blumeria repeat database and representation of the superfamilies in the BES dataset

In conclusion, our Blumeria repeat database is composed of 56 TE families, including some elements which are conserved in Bg tritici and Bg hordei (Table 2).

BAC-end sequencing and TE content analysis

All the 12,288 BAC clones of the library were sequenced from both ends. After trimming the individual sequencing reads for length (threshold of 100 bp) and low-quality bases, vector and bacterial contaminant sequences were eliminated. In the end, the Bg tritici BAC-end database consisted of 20,001 sequences with an average read length of 633 bp (Supplementary Fig. 4). The total BES length is 12,662,922 bp with an average GC content of 44.3%. This large dataset of representative, random sequence was subsequently used to analyze the composition of the Bg tritici genome.

Sequences corresponding to TEs were first identified in the 20,001 BES by BLASTN search against our Blumeria repeat database. The cumulative length of sequences with homology to the 56 repeat families represented 24.1% of the BES database (Supplementary Fig. 5), suggesting that the characterized repeat families could contribute approximately one fourth of the genome. The ten most abundant elements represented half of the TE fraction (49.8%), and accounted for around 12% of the genome (Supplementary Fig. 5). Five LINE elements represented all together 6.2% of the genome. The most abundant element of all was the SINE Bgt_RSX_Yhi (2%).

We then masked the sequences matching the Blumeria repeat database at the nucleotide level, and performed with the remaining sequences a second search against the Blumeria repeat database at the protein level, in order to evaluate the representation of TE superfamilies. A cumulative length representing 23.7% of the BES set gave hits. Taken together with the previous analysis, the fraction of the BES set matching TEs of the Blumeria repeat database is 47.8%, i.e., 6.04 Mb. The analysis of these sequences revealed the predominance of non-LTR retrotransposons over LTR retrotransposons, mainly due to LINE elements (Table 2).

In order to identify additional unknown repeats, we masked all the sequences which previously matched our repeat database at the nucleotide and protein level, and kept only the BES if the remaining unmasked sequence was longer than 50 bp. This resulted in 13,270 remaining BES which were searched against themselves by BLASTN. Repeats or high-copy sequences were defined as sequences with at least two copies in the 13,270 BES set. Considering that the complete BES database represents 7.2% of the Bg tritici minimal genome size, a high-copy sequence according to our definition would then be expected to occur in more than 28 copies in the genome. This search revealed 8,880 high-copy BES with a total length of 4.74 Mb. Together with the 6.04 Mb matching the Blumeria repeat database, we estimate the total repeat content in the BES database, and by extension in the Bg tritici genome, to be 85%.

Discussion

In this paper, we report on the construction and characterization of the first B. graminis f.sp. tritici large insert BAC library. The majority of BAC libraries constructed from fungal or oomycete pathogens have a relatively small average insert size between 40 and 80 kb, and those constructed from the barley powdery mildew Bg hordei were reported to have average insert sizes of 30 and 41 kb (Ridout and Brown 1999; Pedersen et al. 2002). The Bg tritici BAC library consists of 12,288 clones of 115 kb on average with 87% of the inserts larger than 100 kb. This result is remarkable for DNA obtained from a true obligate biotrophic fungus which cannot be cultivated in vitro, and is comparable with the largest libraries reported for ascomycete or oomycete species (Zhu et al. 1997; Zhang et al. 2006; Chang et al. 2007). With a 7.5× coverage of the genome, our BAC library thus represents a powerful tool for the exploration of the Bg tritici genome.

Taking advantage of this library, we could show that Bg tritici possesses an expanded genome of at least 174 Mb, much larger than what is commonly observed for fungal genomes (Gregory et al. 2007). This observation is in accordance with the recently reported genome size of the closely related barley powdery mildew pathogen Bg hordei, which is estimated to be 120 Mb (Spanu et al. 2010), and demonstrates that the formae speciales of the B. graminis species have an atypically large genome size. The high percentage of repeats in Bg tritici (85%) seems to be the explanation for the unusually large size of its genome, which is possibly also true for the genome of Bg hordei as hypothesized by Spanu et al. (2010). We observed that non-LTR retrotransposons in the form of LINEs are predominant over LTR retrotransposons in the Bg tritici genome. SINEs are also surprisingly abundant in Bg tritici and could represent at least 3% of the genome, although they are relatively small in size (Wicker et al. 2007). Similarly, Spanu et al. (2010) observed that LINEs and SINEs are largely predominant over LTR retrotransposons. This picture is different than what was recently reported in other repeat-rich oomycete and fungal genomes such as Hyaloperonospora arabidopsis (Baxter et al. 2010), Melampsora larici-populina and Puccinia graminis f.sp. tritici (Duplessis et al. 2011). In Bg hordei as well as in H. arabidopsis, only a small fraction of class II transposable elements was detected (Spanu et al. 2010; Baxter et al. 2010), which is not the case for M. larici-populina and P. graminis f.sp. tritici where the proportion of class I and class II elements is more equal (Duplessis et al. 2011).

The very stringent parameters we used to assess the fraction of repeat DNA (80% identity) indicates that repeat copies are very similar, which could suggest that proliferation of repetitive DNA in Bg tritici is the consequence of a high rate of recent transposon activity. Recently, Oberhaensli et al. (2011) sequenced and annotated three Bg tritici BAC clones. They found a large difference of TE content in a comparative analysis with Bg hordei, indicating that indeed most of the TE activity in the two genomes occurred after divergence of the two formae speciales, around 10,000,000 years ago. In the same study, it was found that TEs accounted for 48.8% and 51.4% of the contigs length, respectively. However, those clones were specifically screened to encompass gene-containing regions. On a third locus, TEs were shown to occupy up to 69% of the sequence (F. Parlange, unpublished results), which is closer to the estimation presented in the current study. This suggests that repeated elements may not be equally distributed along the genome, and proves the importance of generating large and randomly dispersed sets of sequences to draw an accurate picture of the composition of large and highly repetitive genomes.

The reports on genome sequences from three powdery mildew species, including Bg hordei, Erysiphe pisi, and Golovinomyces orontii (Spanu et al. 2010), and the “downy mildew” H. arabidopsis (Baxter et al. 2010) highlighted striking signatures of convergent evolution to an obligate biotrophic lifestyle, in particular marked by an unusually expanded genome size correlated with a proliferation of transposable elements. Recently, the same observation was reported in two other obligate biotrophic parasites, the rust fungi M. larici-populina and P. graminis f.sp. tritici (Duplessis et al. 2011). Those observations in different evolutionary lineages support the hypothesis of Spanu et al. (2010) that large genome size and high repetitive DNA content are common hallmarks associated with obligate biotrophy. Transposable elements affect the genome by their ability to move and replicate. They can generate high levels of genetic variation independent of sexual recombination, and could contribute to genome flexibility responsible for rapid adaptation of populations to selection imposed by resistance genes in the case of phytopathogenic fungi or to environmental constraints for symbionts. The genomes of the basidiomycete fungus Laccaria bicolor and the ascomycete Tuber melanosporum, which form ectomycorrhizal symbiosis with their host plant, were also reported to be 65 and 125 Mb respectively, with a high proportion of repeats (21% and 58% respectively; Martin et al. 2008, 2010).

A convergent biotrophic adaptation was also observed at the genetic level, with a common reduction of genes which are not essential for the biotrophic lifestyle, such as genes encoding enzymes involved in the primary and secondary metabolism (Spanu et al. 2010), enzymes devoted to plant cell wall degradation (Spanu et al. 2010; Baxter et al. 2010; Duplessis et al. 2011) and transporters (Spanu et al. 2010). The absence of genes involved in the inorganic nitrate and sulfur assimilation pathways also seems to be a feature of obligate biotrophic genomes (Spanu et al. 2010; Baxter et al. 2010; Duplessis et al. 2011). However, little is still known about the molecular mechanisms involved in the establishment of the interaction between obligate biotrophic fungi and their hosts. Investigations on those aspects represent the major challenge in the study of this class of pathogens.

The future sequencing and annotation of the complete Bg tritici genome are the next steps in the exploration of this genome. Sequencing can now be considered through next generation sequencing technologies (Nowrousian et al. 2010), and the physical map and BES generated in this study should greatly facilitate assembly of the genome. The updated Blumeria repeat database will also help to overcome difficulties related to the massive presence of TEs and simplify the identification of gene coding sequences. This should provide the opportunity for comparative studies with the other recently sequenced powdery mildew genomes or, at a broader scale, with obligate biotrophic genomes, and contribute to the understanding of the molecular features determining the pathogenesis of those parasites.