Background

Oomycetes form a phylogenetically distinct group of eukaryotic microorganisms which includes plant and animal pathogens, that cause widespread damages of high economical [13] and ecological impacts [4]. Pathogenic oomycete species are found mainly in three orders, the Pythiales, the Peronosporales and the Saprolegniales [5]. From recent studies on the phylogenic relationships within oomycetes, it has been suggested that the ability to infect plants appeared at least twice in the oomycete lineage, first in an ancient lineage which evolved into the Pythiales (including Phytophthora and Pythium) and Peronosporales, and secondly in the Saprolegniales lineage [6], which includes destructive animal pathogens such as the fish pathogens Saprolegnia parasitica and Aphanomyces piscida, and plant pathogens such as A. euteiches and A. cochlioides. Among members of Oomycetes, Phytophthora is the best studied genus and genomic resources are available for several species (cDNA libraries and/or complete genome sequence) [711]. In contrast, few nucleic acid sequences are available from species classified in the Saprolegniales, the main data consisting in a 1510 ESTs collection obtained from the fish pathogen S. parasitica [12].

Recently, we have developed a pathosystem which consists of the model legume Medicago truncatula and the legume pathogen A. euteiches [13]. Up to now, no chemical compound is able to protect efficiently against A. euteiches and infested fields cannot be used any longer for legume production during many years [14]. In order to get a better knowledge of the A. euteiches-legume interaction and to identify molecular targets for drug design, a genomic approache was used. Two unidirectional cDNA libraries from a A. euteiches strain ATCC201684 [15] were prepared: one from mycelium growing in a liquid medium containing yeast extract and glucose (MYC library), and one from mycelium in contact to M. truncatula root tissues (INT library). The latter situation is a simplified model for growth during pathogenesis. A total of 18,684 expressed sequence tags (9,224 from MYC library and 9,460 from INT library) were submitted to the EBI databank for accession number assignation and were assembled into 7,977 unigenes. Here we present a database named AphanoDB in which all the data were structured. Users can retrieve information using text searches or BLAST analyses. AphanoDB is currently the most extensive resource of Aphanomyces sequences and related annotations.

Construction and content

Preparation of cDNA libraries and sequencing

The MYC library (saprophytic library) was made with total RNA isolated from mycelium after 5, 7 and 9 days of growth in liquid YG medium (2.5% Yeast extract, 5% Glucose, w/v) at 23°C in the dark. Mycelia were frozen in liquid nitrogen and total RNA was extracted. RNAs of each sample were mixed in equal amounts. mRNAs were purified using an Oligotex mRNA purification kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. A unidirectionnal library was constructed in pSport1 plasmid using the Superscript Plasmid System for cDNA Synthesis and Cloning (Invitrogen, USA). Plasmid ligations were transferred by electroporation into E. coli DH5α cells following the manufacturer's protocol (Invitrogen Life Technologies, CA, USA). A library of approximately 5 × 105 colony forming units was obtained.

The INT library (interaction library) was made with total mRNAs purified from mycelium grown for 1 or 2 days on M. truncatula roots. Roots of two week-old plants were laid onto a 2 day-old mycelium, and culture continued for 1 and 2 days. Before the mycelium harvesting, plants were gently removed from the Petri dish to avoid any contamination of A. euteiches RNA with plant RNA.A. euteiches mRNA extracts were pooled and cDNA generated as described for the MYC library. A library of approximately 5 × 105 colony forming units was obtained. About 10,000 clones from each library were 5' end sequenced using T7 primer on ABI3730xl DNA Sequencers.

EST quality and assembling

The two libraries were processed together. All reads were obtained using the Phred program [16]. Only reads with a Phred value over 20 on 80% of the sequences and with a length of over 100 bp were selected. MultiFASTA sequence and quality files were generated and cleaned by vector and adaptor trimming using crossmatch (Figure 1). After these steps 18,684 high-quality sequences were obtained corresponding to 93% of the initial sequence set. These sequences were assembled using the CAP3 program [17]. Minimum overlap length was set to 100 bp, minimum identity percent to 97% and maximum gap length to 30 bp (-o 100 -p 97 -f 30). ESTs from the two libraries were assembled to obtain 7,977 unigenes composed of 2,843 contigs and 5,134 singletons (Table 1). Consensus sequences were renamed with a unique identifier with the prefix 'Ae' for the species, 3 digits for the number of ESTs included in the contig, the 2 letters 'AL' for the strain and 5 digits as individual number.

Figure 1
figure 1

AphanoDB pipeline flow chart.

Table 1 Statistics on AphanoDB ESTs status.

EST analysis and functional annotation

Assignment of putative functions was performed by running the BLASTX algorithm (release 2.2.14) [18], against a local NCBI non-redundant (nr) protein database (5-17-2007 Version). 61% of the sequences showed similarity to a protein sequence with an E-value ≤ 1e-10. To estimate the level of contamination of A. euteiches ESTs with M. truncatula cDNA sequences which might occur in the INT library, the unigene sequences were compared to 270,000 M. truncatula ESTs deposited in the GenBank database using the BLASTN algorithm. Only 11 unigenes showed a high similarity (E value < 1e-100) to M. truncatula ESTs and only two unigenes were composed of ESTs from the INT library. However, these two sequences displayed a higher similarity to Phytophthora sequences than to M. truncatula sequences. From this analysis, it can be concluded that contamination of the INT library with M. truncatula cDNAs is very low if any. Sequences were compared to proteome data from seven different fully sequenced organisms using BLASTX algorithm, since parts of the proteome of these organisms are not present in the nr database. To facilitate comparative analyses between oomycete species, A. euteiches sequences were compared to the P. sojae and P. ramorum proteomes [8]. Since it has been shown that oomycetes are phylogenetically related to diatoms, A. euteiches sequences were compared to the Thalassiosira pseudonana proteome [19]. In order to find genes putatively involved in plant pathogenesis, A. euteiches sequences were compared to the proteome of the fungal pathogen Nectria haemetococca.Toxoplasma gondi [20] and Plasmodium falciparum [21] proteomes were selected since it has been suggested recently that apicomplexan parasites and oomycetes share common infection strategies [22]. Finally, the proteome of Arabidopsis thaliana was also added to analysis.

Domain searches using InterProscan program (release 4.2) [23] were performed locally with HMM searches against Pfam protein database (16.0) [24]. 45% of sequences showed a known Pfam domain with an E-value ≤ 1e-5 (Table 1).

To classify the sequences according to the Gene Ontology classification scheme [25], Pfam domains with InterPro accession number were linked to GO molecular function, biological process and cellular component terms using the interpro2go file [26]. Finally, 42% of the sequences were assigned to a GO molecular function term, 43% to a biological process and 24% to a cellular component (Table 2).

Table 2 Classification of unigenes based on gene ontology (GO) mappings. Mappings of the InterPro domains to terms in the GO hierarchy were used to assign GO terms to the unigenes. Sequences for which a protein domain was predicted with an E value < 1e-5 were selected for this analysis.

Gene Ontology SQL database format was downloaded from the GO web site [27] and added to AphanoDB. Links between GO terms and the Enzyme Nomenclature terms [28] were established using ec2go mapping [29], and KEGG metabolic pathway map links [30] were also added to the database.

Repeat sequences were detected using RepeatMasker [31] with default parameters in order to identify microsatellite markers and result outputs were added to the database.

To estimate differences in gene expression levels between the saprophytic and the pathogenic growth conditions, a statistical analysis was performed based on EST frequency in each library. We used the test described by Susko et al. [32] to calculate the p-value and false discovery rate control (FDR) for multiple test correction [33] (false positive rate is controlled at less than α = 0.05).

Database implementation

AphanoDB is a MySQL database containing features, cleaned ESTs and deduced contigs. At the end of each step of the AphanoDB pipeline, information was stored in the related table and linked to the core tables containing EST and consensus sequences. The dynamic structure enables addition of new sequences of different oomycete species and associated features. The database is available through an Apache Web server running on Fedora core 6 Linux. The web interface is based on the PHP, JavaScript and HTML languages; it enables dynamic MySQL queries with a user-friendly graphical interface.

Utility and discussion

User interface

AphanoDB provides a complete summary sheet containing all annotation results for each predicted unigene (Figure 2). The BLASTX results against the nr database, HMM search results against Pfam database and the Gene Ontology assignments are displayed on this summary sheet. BLASTX alignment details are also provided and will be updated yearly. The summary sheet also contains the sequence of the unigene, the link to the ESTs alignment, to the Ace file and another link to a multiFASTA file of the contig components.

Figure 2
figure 2

Web interface. A. Hierarchical browsing of Ontologies including distribution of the genes in the subcategories. B. Domain description query output with e value cut-off. First output clusters the genes by InterPro domain prediction. C. Part of gene report sheet.

Users can query the database for a given sequence by providing its accession number, EST name or gene ID. Searches on the Gene Ontology annotation are possible with graphical bars representing GO subfamilies (Figure 2A). For the GO catalytic activity subfamilies, EC links and KEGG metabolic pathway map links are provided. Function searches are allowed by queries by InterPro or Pfam domain names or accession numbers. BLASTX results stored in the database can be queried using organism name or putative function (Figure 2B–C). Queries on repeat sequence types and sizes are also available as well as on expression profile in a specific growth condition. For expression profiles, the output shows only significant results with a P-value lower than the Benjamini and Hochberg cut-off [33]. Users can submit their own sequences for BLAST searches against the A. euteiches sequences using BLASTN, TBLASTN and TBLASTX. The BLAST output page displays one summary sheet for each hit on the A. euteiches sequences.

Utilities and extensions

AphanoDB provides molecular data about A. euteiches transcripts. The database contains 18,684 high quality sequences and allows direct applications for functional and comparative genomic approaches. AphanoDB is constructed in such a way that it can absorb a large number of additional sequences from other oomycete species, such as S. parasitica, for which large scale cDNA sequencing projects are under way.

Conclusion

AphanoDB represents a major contribution to assist genomic studies on Oomycetes and other related organisms such as diatoms and brown algae. Addition of new sequences from other Aphanomyces species and other Saprolegniales is planned in the near future. AphanoDB will facilitate gene prediction and annotation for the future whole genome sequencing of Saprolegniales species. AphanoDB contains cleaned, assembled and annotated ESTs which will serve the oomycete research community. The database provides comprehensive tools for comparative approaches that might lead for example to leading to the identification of pathogenicity factors.

Availability and requirements

AphanoDB is freely available to academic and non academic users at http://www.polebio.scsv.ups-tlse.fr/aphano/. A browser supporting frames must be used (Firefox, Netscape 2.0, Internet Explorer 3.0 or higher). cDNA clones can be obtained upon request at http://cnrgv.toulouse.inra.fr/ENG/. All ESTs can be downloaded from AphanoDB. They have been also deposited in dbEST (accession numbers CU357053 to CU361296).