Introduction

To investigate the transcriptome of the plant-parasite Radopholus similis (the burrowing nematode), we explored newly generated and existing expressed sequence tags (ESTs). Generation of ESTs is a cost-effective method to generate large amounts of sequences, and is widely used as a first step in obtaining molecular data of a certain species. However, these sequences are typically of relatively low quality as numerous biases can be introduced along the whole process. Artifacts introduced in the first step, the construction of a complementary DNA (cDNA) library, include low occurrence of full length transcripts, exclusion of very short and very long transcripts, generation of chimeric constructs, and the inclusion of contaminating DNA or rRNA sequences (Nagaraj et al. 2007a). Furthermore, estimates of internal priming lies around 2–3% (Aaronson et al. 1996). The sequencing step starts with a random selection of clones from the cDNA library, followed by one-time sequencing of the inserts. The resulting EST sequences are contaminated with fragments of the vector and/or adaptors, and can contain up to 3% erroneous bases (Nishikawa and Nagai 1996). Contaminating sequences are usually removed before submission or additional analyses. Nowadays, a plethora of bioinformatic tools for EST analysis exists (Nagaraj et al. 2007b). Accompanied by a thorough understanding of the generation of EST sequences, these tools can deliver valuable information, which can serve different purposes: obtaining a first impression of the molecular composition of species (McCarter et al. 2003), identifying (tissue, developmental stage or organism specific) genes (Chen et al. 2006; Dubreuil et al. 2007), estimating the level of gene expression (digital northern) (Liu and Graber 2006; Munoz et al. 2004), annotating genome sequences (Blumenthal et al. 2002), and facilitating proteome analysis (Liu et al. 2006).

Radopholus similis is an obligate migratory plant-parasitic nematode (PPN), mainly occurring in subtropical regions. It parasitizes the roots of over 365 host plants, of which banana, plantain and citrus are economically the most important (O’Bannon 1977; Sarah et al. 1996). The nematode remains mobile throughout the whole life cycle (hence the adjective “migratory”), and every mobile stage can infect new roots, except the males which make up about 5% of a normal population. With the help of secreted proteins, originating from large pharyngeal gland cells and secreted through a hard hollow spear-like structure in the head (the stylet), the nematode penetrates the root and digests the cortex cells, resulting in large necrotic lesions in which secondary infections rapidly take place, mainly by Fusarium oxysporum and Rhizoctonia solani. The result of this nematode infection is stunting and wilting of the host plant or in severe cases even toppling due to the weakened stem base (known as the “blackhead-toppling disease” in bananas). These effects can cause massive losses in crop production ranging from 5 to 75% (O’Bannon 1977; Price 2006; Sarah et al. 1996). Once established in the field, it is very hard to nearly impossible to eradicate R. similis. Chemical control is extremely hazardous for the environment and most “nematicides” (e.g. methyl bromide) are banned for this reason (United Nations Environment Programme 1995). Since only very few resistant varieties are yet available, new sources of resistance are being sought (Elsen et al. 2004; Stoffelen et al. 2000; Wuyts et al. 2007). Recently, transgenic banana expressing cystatin was proven to posses some level of resistance to R. similis infection (Atkinson et al. 2004). To support the on-going research on this devastating plant-parasitic nematode, we present the generation and analysis of EST sequences from mixed stages of R. similis to gain a first insight into the transcriptome.

Materials and methods

Laboratory experiments

Radopholus similis was cultured at 25°C on carrot disks in parafilm sealed small petridishes (Jacob et al. 2007). Approximately 5,000 nematodes of mixed stages were collected in sterile demineralized water. After grinding of these nematodes in liquid nitrogen, RNA was extracted with TRIzol® Reagent (Invitrogen, Carlsbad, USA), precipitated with isopropanol and washed with 70% ethanol. The pellet was redissolved in diethylpyrocarbonate(DEPC)-treated demineralized water. Integrity of the RNA was checked by electrophoresis on a 0.5× TAE 1% agarose gel. Concentration was determined with the ND-1000 spectrophotometer (Nanodrop, Wilmington, DE, USA). This RNA served as a basis for cDNA library construction using the SMART™ cDNA Library Construction Kit, following the manufacturer’s instructions (Clontech, Mountain View, USA). The resulting fragments were directionally cloned in the pDNR-Lib vector provided by the kit. The R. similis mixed stage cDNA library contained over 105 primary transformants. Clones were sequenced using the M13 forward or reverse primer at the Genome Sequencing Center (GSC, Washington University, St Louis, USA). Sequences and quality files can be found on Nematode.net (Wylie et al. 2004), and sequences were submitted to the EST division of GenBank (dbEST, Boguski et al. 1993).

Cleaning and clustering

The sequences were cleaned using Seqclean (http://www.tigr.org) with a locally downloaded vector database and default parameter settings, to remove vector, poly(A) and short (<100 nt) sequences. Next, the dataset was clustered using TIGR Gene Indices Clustering Tool (TGICL) (Pertea et al. 2003), and assembled sequences were constructed by CAP3 (Huang and Madan 1999) using default settings, generating contigs (clustered ESTs) and singletons (non-clustered ESTs), commonly referred to as “unigenes” (ESTs used for clustering can be found in additional material file A5). Based on the clustering results, ESTstat was used to estimate the degree of fragmentation (Wang et al. 2006a). To gather ESTs with poly(A) sequences (and thus containing 3′ untranslated regions), cleaning by Seqclean was redone without poly(A) screening (option -A). The sequences differently cleaned compared to the first cleaning contain predicted poly(A) sequences.

BLAST-searches

The basic local alignment search tool (BLAST) analyses (Altschul et al. 1990) were performed both locally and via netblast. The BLASTx-results were parsed by an in-house perl script: for each hit, the species and phylogenetic classification was obtained from its GenBank file and used for subsequent classification of the unigene query (as nematode-, animal-, eukaryote-specific, etc.). R. similis unigenes were also used for BLASTx against C. elegans sequences (E value cut-off of 1e-05), and the top-hit sequences were used to estimate the degree of fragmentation (Mitreva et al. 2004). For this estimation, 384 of 1,632 unigenes with C. elegans hits share the same top-hit with one or more other unigenes. Of these unigenes, 221 were “redundant” as the C. elegans top-hit was already detected by another unigene(s), from which the fragmentation can be estimated. Further, all available nematode EST sequences were downloaded (June 2007) and searched locally with tBLASTx for homology to the unigenes of Radopholus similis. Of those tBLASTx hits, the developmental stage and nematode species used for the cDNA library construction were parsed from its GenBank file. A tBLASTx-search was performed with the R. similis unigenes as query against the coding mitochondrial sequences of all nematode species available in GenBank (September 2007). To address the nature of cluster 1, coding and non-coding classification was done for this cluster by ESTScan (Iseli et al. 1999) and RNAz (Washietl et al. 2005).

Translation

The FrameD gene prediction and translation program (Schiex et al. 2003) was trained by manually selected full length coding open reading frames (ORFs) from the set of unigenes, based on the BLASTx-results and Clustal W alignment with the corresponding most homologous sequences (Thompson et al. 1994) (ORFs of cluster 4, 5, 8, 11, 12, 13, 15, 18, 23, 25, 28, 31, 35, 41, 42 and 43). This set was extended with 7 full-length coding sequences of R. similis yet in GenBank (accession numbers AM691117.1, AM691118.1, EU190885, EF693940, EF693941, EF693942, EF693943), resulting in a total of 15,069 coding nucleotides. Using these sequences, an N-resistant Markov model was build on the website of FrameD (http://bioinfo.genopole-toulouse.prd.fr/apps/FrameD/FDM.pl). Predictions were analyzed for the occurrence and position of the coding sequence part of the unigene. Other unigenes which were predicted to be non-coding by FrameD, but with homologous proteins in other species according to BLASTx, were translated using Prot4EST trained with H. glycines sequences (Wasmuth and Blaxter 2004). Signal peptide prediction on this set of translations was done by SignalP 3.0 (Emanuelsson et al. 2007) and a signal peptide was only assigned if both the neural network and the hidden Markov model predicted a signal peptide. The signal peptide was cleaved from the translated sequence and subsequent transmembrane domain prediction was performed by TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) and SOSUI (Hirokawa et al. 1998). A protein was assigned to reside in the cell membrane, if both programs predicted a transmembrane domain.

Trans-spliced leader search

Searches were based on 70 spliced leader (SL) sequences reported in Guiliano and Blaxter (2006). In a first approach, a BLASTn search was set up with the SL sequences as query, a minimum matching length of 20, and a cut-off E value of 0.1. Using these parameters, a search was done in the R. similis unigene dataset. Since SLs have been shown to occur in the plant-parasitic nematode Meloidogyne incognita, a locally downloaded set of EST sequences of this nematode was used as a control (Guiliano and Blaxter 2006; McCarter et al. 2003). To minimize influence of a technical nature, we selected an EST set of M. incognita of a similar size as ours and generated by the SMART technology. The chosen control EST set has 3,098 ESTs sequenced from a SMART cDNA library, constructed from females of M. incognita (library “Meloidogyne incognita female SMART pGEM”). Since the BLASTn search yielded no results for the R. similis unigenes, a second approach was applied using a perl regular expression pattern, based on common features extracted from the SL sequences and the 5′ position of the SL in the unigene. The resulting pattern /^[AGCT]{0,30}GGT[^CG]{4,9}CCC[^C]\w{5,9}AG/ was used to search the SL sequences set, the R. similis unigenes (both strands) and the M. incognita ESTs (both strands). As a confirmation of the SL sequences found, the sequences of M. incognita containing a trans-spliced leader sequence, were used for a tBLASTx-search as a query (E value cut-off of 1e-35), simultaneously against C. elegans EST sequences (from dbEST) and against R. similis unigenes, containing full-length coding sequences and sequences with at least a 5′ UTR sequence part (based on the translation prediction, see “Translation”, and on the presence of a small piece (GGCCGGG) of the 5′ SMART primer). When highly similar ESTs (ranging from 60 to 93% identity on the protein level) in all datasets were found, the corresponding ESTs were aligned (on the DNA level) using Clustal W.

Gene ontology and KEGG biochemical pathway annotation

To map and annotate gene ontology (GO) terms, BLAST2GO was used (Conesa et al. 2005), with default parameters, except for an E value cut-off of 1e-05, maximum number of 30 BLAST hits, E value hit filter for annotation of 1e-05, the conversion of the annotation to GOSlim view, and a node scoring filter in the GO graph of 50 for biological process, 20 for molecular function and 20 for cellular component. Further, KOBAS was used to annotate KEGG biochemical pathways to the unigenes (Mao et al. 2005).

Annotating RNAi data to the unigenes

Using the RNAi data available of numerous C. elegans genes, we tried to assign an RNAi phenotype to the R. similis unigenes. A BLASTx-search revealed the top-hit C. elegans sequence for a unigene (using E value cut-off of 1e-05). Subsequently, the RNAi phenotype and GO terms (only of the C. elegans top-hits with observed RNAi phenotypes) were retrieved via WormMart (Schwarz et al. 2006) and the GO terms analyzed and visualized with WEGO (Ye et al. 2006).

Results

Dataset characteristics

A total of 5,853 new EST sequences were generated, having a slightly higher average sequence length compared to the ESTs of R. similis already deposited in the dbEST division of GenBank (Table 1). Analytical processing of both sets combined (removal of vector sequences, poly(A) tails and sequences <100 nt) resulted in 6,800 ESTs, and subsequent clustering (merging overlapping sequences together into “contigs”) established a 13% increase in sequence length. This final set of unigenes contains 1,008 contigs, grouped into 989 clusters (enclosing sequences with minor sequence variations), and 2,659 “singletons” (non-overlapping EST sequences). With growing cluster size (i.e. the number of ESTs contained in a cluster), the number of clusters decreases logarithmically (Fig. 1). A certain degree of “fragmentation”—also called underclustering—could be expected in our final dataset and was estimated by ESTstat to be as high as 15.8% (Wang et al. 2006a). Another method described by Mitreva et al. (2004), resulted in a comparable estimation of 12.9%. Due to this fragmentation error, our dataset represents at most 3,194 genes, which is approximately 16% of the total gene number, if assumed similar as in Caenorhabditis (Stein et al. 2003).

Table 1 Dataset characteristics on DNA and protein level
Fig. 1
figure 1

Graphical representation of the cluster size distribution. Y axis The number of clusters. X axis cluster size (i.e. the number of ESTs represented by the cluster)

BLASTx analysis

A BLASTx-search against the GenBank non-redundant protein sequences resulted in hits for 2,130 unigenes (58.1% of the total unigene set). Of these, 1,710 (46.6%) had an E value lower than 1e-05 (more significant match), whereas the remaining 420 sequences (11.5%) were only retained with higher E values (between 1e-05 and 1e-01). Ribosomal proteins were of the most abundant top-hits (n = 176 or 4.8% of the unigenes). 535 unigenes (14.6%) matched sequences originating from both eukaryotes and prokaryotes, and 622 unigenes (17.0%) matched solely to sequences from all major eukaryotic lineages. The wide occurrence of these unigenes suggests a role in basal cell metabolism. Surprisingly, 14 unigenes gave a plant-specific hit. Since R. similis was cultured on carrot disks (Daucus carota), the presence of contaminating plant tissue can explain these sequences, although the top-hit sequences originated from different plant species (with an E value range between 1e-10 and 1e-01). Of the remaining unigenes with a BLASTx-hit, 428 exclusively matched animal sequences (11.8%), of which 328 (8.9%) were nematode-specific (see Fig. 2). Seven of the nematode-specific unigenes were found to match exclusively sequences of plant-parasitic nematodes, and 8 matched both plant- and animal-parasitic nematode sequences (see Table 2). Special attention was paid to the largest clusters, as they correspond most likely to highly expressed genes in R. similis. The BLASTx-results of the largest clusters are reported in Table 3: commonly known highly expressed genes are found (such as actin, sec-2), but some pioneer sequences are present as well. A considerable subset of our unigenes (n = 1,537; 41.9%) gave no BLASTx-hits with an E value cut-off set as high as 1e-01. One striking feature of these unigenes is their shorter average sequence length (354 ± 166 nt) compared to the unigenes with hits (518 ± 185 nt) (P value two sample t test <0.001) (Fig. 4). For many of the unigenes the short sequence length is the cause for missing BLASTx hits, as their E values will not reach the preset threshold.

Fig. 2
figure 2

Overview of the EST analysis (colors are visible in the on-line version). a Graph of density lines of the G + C content of different unigene subsets; b Length distributions of different subsets of unigenes, that are represented in c; c Classification of unigenes with BLASTx homology (E value cut-off of 1e-1; blue), with indication of the portion of unigenes with nematode-specific hits (“nema specific”, matches to nematode proteins only), animal-specific hits (matches to animal proteins only), plant-specific hit (“plant sp.”; matches to plant proteins only), eukaryote specific hits (matches to eukaryote proteins only), prokaryote-specific hits (“prokaryote sp.”; matches to prokaryote proteins only) and unigenes with hits to pro- and eukaryote sequences (“general”). The orange part marked with “tBLASTx” indicates unigenes without BLASTx-hits, but with tBLASTx-hits in nematode ESTs. The gray part marked with “orphans” indicates the subset of unigenes without BLASTx- or tBLASTx-homology. The inner black and white circle correspond to coding (black) and non-coding (white) prediction based on FrameD and BLASTx-homology. The small portion of mitochondrial and Wolbachia-derived sequences are indicated in the circle respectively as violet and dark-blue

Table 2 Nematode specific unigenes and potential parasitism genes
Table 3 BLASTx reports of the 15 largest clusters

Homologues in nematode EST sequences

To find homologues in the transcriptional data of other nematodes (ESTs of all nematodes except R. similis), a tBLASTx-search was performed with the unigenes (E value cut-off of 1e-05). The E value cut-off for the tBLASTx-search was set lower than the E value cut-off for the BLASTx-search (i.e. 1e-05, compared to 1e-01), since for the majority of the unigenes a consistently lower top-hit E value with tBLASTx (i.e. more significant) was found compared to BLASTx (see additional material figure A1). This tBLASTx-search reported 2,305 hits to nematode EST sequence, of which 560 unigenes (15.3%) with homologous EST sequences exclusively in plant-parasitic nematodes (PPN), 106 unigenes (2.9%) exclusively in animal-parasitic nematodes (APN), and 147 (4.0%) unigenes exclusively in both APN and PPN. As seen for the sequences with and without BLASTx-hit, a similar difference in sequence length could still be observed between unigenes with and without tBLASTx-hit: unigenes having homologues in the nematode ESTs are generally longer (408 ± 169 nt), compared to those without homologous counterparts (344 ± 163 nt). The persistence of this difference in sequence length, points to an important influence of sequence length in finding homologues based on BLAST-searches, arguing for a thorough quality check of the used cDNA library. Furthermore, the tBLASTx-search revealed a large portion of the unigenes (n = 408 or 11.1%) without BLASTx homology to known proteins, but with homology to EST sequences of other nematodes (Fig. 2). Notably the majority of the unigenes with homology to PPN EST sequences lack a BLASTx-hit (367 of 560 unigenes, or 65.5%). Despite the efforts to identify unigenes on basis of homology using BLAST-searches, unigenes without hits (either BLASTx or tBLASTx; the so called “orphans”) constitute still a large portion (n = 1,128 or 30.8%, see Fig. 2).

Annotation of the unigenes

Annotation of “gene ontology” (GO) terms helps to categorize unigenes based on their putative function. We used the user-friendly BLAST2GO program to explore the R. similis unigene data set (Conesa et al. 2005). This annotation method is based on sequence homology determined by BLAST-searches. For 1,920 unigenes BLAST2GO could not find a homologous sequence and no mapping could be retrieved for 259 sequences with BLAST homology. Finally, after mapping a total of 5,501 GO terms to the unigenes, 812 sequences (22%) were successfully annotated, with a higher success rate as the sequences get longer (see Fig. 3). Analyzing the main GO category “biological process” and its child terms, revealed that “embryonic development”, “growth” and “reproduction” GO terms are the most represented, followed by terms involved in basal cell metabolism based on annotation scores assigned by BLAST2GO (see Fig. 3). This can be a reflection of the high reproductive rate of R. similis and the high percentage of females (and developing eggs) in the population used for cDNA library construction. In the main GO category “molecular function”, the “protein binding” term is most represented (∼38% of the terms), followed by “structural molecule activity” and “RNA binding”. Many unigenes encoding ribosomal proteins are assigned to the “protein binding” term, and also highly expressed genes coding for structural molecules (such as actin) and regulatory molecules (such as transcription factors). Since those unigenes are abundantly present in the dataset, this causes the overrepresentation of the “protein binding” term. Regarding the main GO category “cellular component”, the term “ribosome” is most represented, constituting together with the term “cytosol” almost half of the total terms. Since the nematode secretes a cocktail of proteins into the plant to control the parasitism process, interesting sequences are supposed to be found under the term “extracellular region”. However, the GO terms are assigned using homology to known annotated sequences and scarcely any parasitism gene of plant-parasitic nematodes has GO terms assigned to date; consequently most parasitism genes are likely not annotated. Furthermore, the “extracellular region” term encompasses also abundantly expressed genes coding for proteins secreted by the gut, epidermis and the nervous system. The answer to parasitism gene annotation may come from a more fine-tuned GO classification adapted to parasitism, since many parasitism genes are not easily classified in the present classification (Berriman et al. 2001). In summary, the GO annotation of the unigenes is a representation of the biology of R. similis and the characteristics of the cDNA library, and is rather unsuited to detect gene expression correlated with parasitism.

Fig. 3
figure 3

Summary of the Gene ontology annotation, results by Blast2GO; a Comparison of length distribution of all unigenes (black line) with successfully GO annotated unigenes (gray surface); b Most represented GO terms (based on annotation score) of the main category “biological process”; c Most represented GO terms of the main category “cellular component”; d Most represented GO terms of the main category “molecular function”

Assigning RNAi phenotypes

Exploring the RNAi phenotypes in the R. similis unigene dataset can lead to potential control strategies based on disrupting gene expression. We made use of the RNAi phenotypic data available for C. elegans. Using BLASTx (E value cut-off of 1e-05), 1,638 unigenes were found to have a homologous C. elegans gene. Of those C. elegans genes, 659 have a detectable RNAi phenotype. Comparing the GO term distribution of genes with RNAi phenotypes to the complete GO term distribution (see “Annotation of unigenes”) revealed marked changes. Genes involved in “biological regulation” (GO:0065007, main GO category “biological process”), are enriched from 1% in the complete GO annotation to 10% in the GO annotation of genes with RNAi phenotypes, at the cost of general biological processes, such as metabolic (GO:0008152) and cellular process terms (GO:0009987) (from 19 to 10% and 18 to 12% respectively). Similarly, in the “cellular component” GO terms, an increase is seen in “macromolecular complex” (GO:0032991) (from 7 to 13%) and “organelle part” (GO:0044422) (from 3 to 8%), while the largest decreases are for “organelle” (GO:0043226) (from 22 to 18%) and “extracellular region” (GO: GO:0005576) (from 2 to 0.5%). The GO term distribution of “molecular function” does not show any remarkable difference. Three quarter of the RNAi phenotypes (n = 506, 76.8%) report a lethal effect. Compared to the total gene set with RNAi phenotypes, the genes with lethal RNAi phenotypes are significantly enriched in the GO terms “macromolecular complex” (GO:0032991), “developmental process” (GO:0032502), “growth” (GO:0040007) and “multicellular organismal process” (GO:0032501) (see Fig. 4). On the other hand, some GO terms, such as “enzyme regulator” (GO:0030234), “molecular transducer” (GO:0060089), and “cell surface receptor linked signal transduction” (GO:0007166), are depleted in the unigenes with lethal RNAi phenotypes. While disruption of gene expression of these genes is expected to have a less profound influence on nematode survival, the strongest effects on nematode survival are expected when targeting genes involved in developmental processes.

Fig. 4
figure 4

GO terms of the genes with predicted RNAi phenotypes. Black bars represent the total set; gray bars represent the subset with predicted lethal RNAi phenotypes. Only the GO categories (up to level 5) which differ significantly between the two sets (Pearson Chi square test P value <0.05) are represented (analysis and output by WEGO)

Translation of unigenes

It was reported before that coding DNA sequences of R. similis are GC-rich. This is mainly the result of the high mean G + C content of the nucleotides at the third position of each codon (GC3%), which has been previously estimated around 63% (Cutter et al. 2006; Haegeman et al. 2008). This feature poses a potential problem for translation of the DNA sequences using standard sequence translation rules. This is due to the fact that every stop codon starts with a uracil (U) (UAG, UGA and UAA). A high GC3% means that most of the nucleotides at the last position of the codons are G or C. Hence the reverse complementary codons start mostly with G or C, reducing the likelihood to encounter a stop codon. It was noticed that translation based on the longest open reading frame (ORF) will therefore frequently result in translating the wrong strand. The translation prediction program FrameD (which uses a Markov model) is especially build for dealing with a high GC3%: it predicts coding regions and corrects frame-shifts to obtain reliable translations, based on a set of training sequences (Schiex et al. 2003). We trained FrameD with 15 kb of coding nucleotides from manually collected full length ORFs based on BLASTx-results and cloned R. similis genes submitted to GenBank. Testing the performance of FrameD, it classified all 226 unigenes containing 3′ untranslated regions (selected by the presence of a poly(A) tail and polyadenylation signal) as non-coding, indicative of a low false positive error. Consequently, the false negative error is rather high, since FrameD classified only 77% unigenes (1,317 of 1,710) with BLASTx-hits as coding. To correct for this error, the unigenes with at least one BLASTx-hit, but lacking a FrameD translation, were translated with the Prot4EST translation pipeline (Wasmuth and Blaxter 2004). On the total unigene set, FrameD classified 2,245 (61.2%) unigenes as coding, with detection and correction of frame shifts in 6.4% of the translations. Of the remaining unigenes, 552 sequences had at least one BLASTx hit, and were subsequently translated by Prot4EST. In this way, the total number of protein coding unigenes reaches 2,797, or 76.3% of the unigene dataset (Table 1). Calculation of the G + C content of the coding part of the unigenes resulted in an overall GC-percentage of 53.7%, a GC1% of 55.5%, a GC2% of 40.7% and a GC3% of 64.8% (see additional material figure A2). Searching signal peptides for secretion in the translations revealed that 216 of the 2,755 translations (7.8%) were predicted to contain a signal peptide, of which 156 (5.6% of the translations) lacked a transmembrane region. Based on these predictions, we identified 4.3% of the unigenes coding for secreted proteins. Remarkably, the translation prediction showed that 447 orphan sequences were predicted to be coding (39.6% of the orphans). These interesting unigenes are in all probability novel protein coding genes, without any known homologue in the database to date. But on the other hand and equally remarkable, a large fraction of the unigene sequences (at most 682 sequences or 18.6%) is predicted not to be protein-coding, pointing to the existence of EST sequences derived from non-coding RNA.

Trans-spliced leaders

One of the major eccentricities in the molecular biology of nematodes is the widespread occurrence of operons and trans-spliced leader sequences, frequently found at the start of transcripts (Guiliano and Blaxter 2006). To investigate the occurrence of the presently known trans-spliced leader sequences in our unigenes, searches were performed based on common features of a reference set of 70 spliced leader (SL) sequences reported in Guiliano and Blaxter (2006). Surprisingly, in the first approach using BLASTn-searches, none of the R. similis unigenes in our dataset matched to any SL sequence, whereas in a control dataset of Meloidogyne incognita ESTs (in which trans-splicing is known to occur), 293 sequences matched to a total of nine different SL sequences of our reference set. After this negative result, a second approach was applied using a pattern, which was able to match 76% of the reference SL sequences. But likewise, this pattern was unable to match any R. similis unigene. On the contrary, this pattern found motifs in 12.2% of the control data set of M. incognita ESTs (376 out of 3,098 ESTs). To exclude the possibility that by chance none of our unigenes belongs to a nematode gene family which is trans-spliced, we searched very strong homologues of the trans-spliced M. incognita ESTs (detected by the pattern search) in the R. similis unigenes and in C. elegans ESTs. Five R. similis unigenes with complete 5′ UTRs (CL13, CL18, CL25, CL112 and CL929) were found having very strong homology to trans-spliced M. incognita ESTs (ranging from 60 to 93% identity on a protein level). For each of the five cases, the unigene and the trans-spliced M. incognita EST were aligned (on a DNA level), together with the strongest C. elegans homologue. In all the five cases, this alignment revealed the occurrence of trans-spliced leader sequences on both the M. incognita EST and C. elegans gene, but absence of such sequences in the 5′ UTR region of the R. similis unigene (see additional material figure A3). Closer inspection of the 5′ UTR sequences of the five R. similis unigenes could not reveal any sequence similarity, neither with the M. incognita or C. elegans homologues, nor with each other.

Occurrence of mitochondrial ESTs

Depending on the cDNA library construction method, a remarkably high fraction of ESTs can be of mitochondrial origin and these EST can even be used as a guideline for the sequencing of the mitochondrial genome (Gissi and Pesole 2003). We searched our unigene dataset for ESTs most likely derived from the mitochondrial genome, and found five unigenes with significant similarity to various nematode mitochondrial genes (see Table 4). The G + C content of those sequences was 16.4%, very low compared to the mean G + C content of the total unigene set (48.8%; see Fig. 1). This low G + C content is a major characteristic of mitochondrial genomes, in most cases between 20 and 30% (He et al. 2005). The different G + C content of the nuclear unigenes compared to the mitochondrial unigenes could therefore be used to predict the mitochondrial origin of the EST. The G + C content density line of the total unigene dataset follows a normal distribution, pointing to a similar source of the unigenes (see additional material figure A4). However, a bias from the normal distribution is observed at one end of the curve, caused by 21 unigenes with a lower G + C content (from 10 to 26%), most likely all originating from the mitochondrial genome (summarized in Table 4). To refute the possibility of dealing with numts [nuclear insertions of mitochondrial sequences into the nuclear genome (Richly and Leister 2004)], and as part of an ongoing project to sequence the mitochondrial genome of R. similis, primers were designed on two putative mitochondrial unigenes. One primer was based on the nad5 homologue (CL86) and another on coxI (CL429). Via long distance PCR we succeeded in amplifying a piece of 3,396 nt of the mitochondrial genome of R. similis (data not shown). The average G + C content of the amplicon was 14.40% and contained the complete sequence of another mitochondrial unigene (CL21, a coxII homologue). The results of the complete sequencing will be written down in another manuscript. Interestingly, these results will probably shed light onto the origin of the largest cluster in our dataset (cluster 1). This cluster represents a disproportionate large amount (n = 575, or 8%) of the total number of ESTs. Despite this, it shows no significant homology to any known sequence using various BLAST approaches against various databases, and is also predicted to be not coding by different coding sequence prediction programs. Using cluster 1 specific primers, we succeeded in amplifying this unigene from a cDNA pool constructed from DNAse treated RNA, making genomic DNA contamination very unlikely (data not shown). In addition, genomic contamination (such as numts) is supposed to appear as singletons and not as a cluster. The only feature of this sequence that can give us a clue about its origin, is its low G + C content (16.3%, see Table 4). Based on the available mitochondrial genomic data, it is possible that cluster 1 is derived from the mitochondrial genome. Furthermore, in clustered EST datasets of other plant-parasitic nematodes (Bursaphelenchus and Pratylenchus), the largest clusters also represent a disproportionate large part of the EST dataset (Kikuchi et al. 2007; Mitreva et al. 2004). Notably, in these cases homology was found to mitochondrial genes, but certainty for cluster 1 of R. similis will only be achieved when the complete sequence of the R. similis mitochondrial genome is known.

Table 4 Unigenes with the lowest G + C content

Unigenes with similarity to Wolbachia sequences

A subset of 43 unigenes (2%) had homology exclusively to prokaryotic sequences. Although the possibility exists that some of these sequences are the result of contamination, 18 significant matches to genes of the endosymbiotic Wolbachia species are found. Further investigation of all the unigenes (having homologues not limited to prokaryotic species) revealed another 12 unigenes with BLASTx top-hits to Wolbachia (see Table 5). The mean G + C content for these 30 sequences is 36.9% (±4.2%), similar to previously reported G + C percentages of Wolbachia sequences (Foster et al. 2005). The high similarity to known Wolbachia genes indicates that the corresponding unigenes are genuine Wolbachia derived transcripts. Consequently, the discovery of these sequences suggests an endosymbiotic presence of Wolbachia within R. similis.

Table 5 Unigenes with BLASTx top-hits to Wolbachia sequences

Unigenes putatively involved in parasitism

Some unigenes are involved in the parasitism process, based on homology to genes of other parasitic species (see Table 2). Two unigenes coding for plant cell wall degrading enzymes were found: an endoglucanase (which was cloned and characterized by Haegeman et al. (2008)) and a xylanase. Both enzymes soften the plant tissue to facilitate the intracellular migration of the nematode. Those cell-wall degrading enzymes have been identified in numerous plant-parasitic nematodes (Ledger et al. 2006; Smant et al. 1998) and these were extensively studied in the light of parasitism. Other unigenes putatively encode enzymes that can neutralize reactive oxygen (ROS) species, produced by the host as a defense mechanism in response to infection by the nematode (Dubreuil et al. 2007; Jones et al. 2004; Robertson et al. 2000). Further, three unigenes show homology to fatty acid- and retinoid-binding proteins of parasitic nematodes. Fatty acids are compounds that play a role in the host defense-signaling pathway (Kennedy et al. 1995; Prior et al. 2001). As a consequence, nematode proteins that bind such compounds could modulate the host defense to facilitate parasitism. One unigene of this group resembles a gene of the animal-parasitic nematode Brugia malayi (Kennedy et al. 1995). The other two unigenes (CL12contig1 and CL70contig1) show highest homology to SEC-2 proteins (also called FAR, fatty-acid and retinol-binding) of the PPN Globodera pallida. However, where CL12contig1 shows also homology to a (hypothetical) protein of the free-living nematode C. elegans, CL70contig1 has only homology to SEC-2 proteins of parasitic nematodes. This could point to the existence of functionally distinct SEC-2 proteins, one with a general function and another with a function related to parasitism, as hypothesized previously (Garofalo et al. 2003; Prior et al. 2001). Two unigenes have homology to SXP/RAL-2 genes, whose gene products most likely play role in host localization (Prior et al. 2001; Tytgat et al. 2005). In addition to those functionally known genes, some unigenes showed homology to putative parasitism genes without known function. To further identify additional unknown parasitism gene candidates, we searched in the 447 orphan sequences predicted to be protein coding for homology exclusively to ESTs of plant-parasitic nematodes (PPN). A total of 212 such unigenes (5.8%) were retrieved, of which 18 (1.3%) were predicted to encode secreted proteins. Five of them (CL26, CL546, CL793, 92h12, 23g06) were assigned good candidate parasitism genes, since the homologous EST sequences originated exclusively from the parasitic stages of PPN (i.e. second stage juvenile to adults). Future experiments could confirm their putative role in the parasitism process.

Discussion

With the generation of thousands of new EST sequences from mixed stages of the plant-parasitic nematode R. similis, interesting research topics are introduced. Based on our analysis, the R. similis ESTs are derived from three different sources. The majority of the ESTs are derived from the nuclear genome. A small fraction (∼0.6%) has most likely a mitochondrial origin, corresponding to sequences with a very low G + C content (∼16% G + C). Finally, a third subset (∼1%) seems to be derived from a Wolbachia species. To our knowledge, this obligate intracellular endosymbiont is only reported in arthropod species and a few filarial nematode species (Hise et al. 2004; Kramer et al. 2003; Taylor et al. 1999). In these nematode species, Wolbachia seems to be required for successful molting as well as for reproduction of the nematode. In only three genera of plant-parasitic nematodes (Heterodera, Globodera and Xiphinema), bacteria-like endosymbionts—other than Wolbachia—have been found (Noel and Atibalentja 2006; Vandekerckhove et al. 2002). On the other hand, we can not exclude the possibility of an insertion of Wolbachia genes into the genome of R. similis, as recently has been shown that these inserts can be transcriptionally active (Hotopp et al. 2007).

The majority of the unigenes are derived from the nuclear genome of R. similis. Approximately one-third of the unigenes code for proteins involved in general metabolic pathways. Other classifications based on BLASTx-results can be found in Fig. 2, but can slightly change in the future as more sequence data become available. Besides the unigenes with clear homology, a relatively large part of our unigene dataset (30.8%) lacked homology to any sequence in the database to date (called “orphan” sequences). Multiple explanations can be found for these orphans: (1) the most “preferred” one is that the unigene represents a genuine novel protein-coding gene (estimated to be 12.1% of the unigenes). However, (2) the length of the unigene also plays a role, as a correlation exists between the length of a unigene and its homology significance level. Thus significant homology can simply not be detected if the sequence is too short. Alternatively, (3) unigenes containing mainly untranslated region (UTR) will most likely lack homology, as UTRs are the most diverse regions of transcripts (McCarter et al. 2003). Moreover, (4) for unigenes derived from (non-coding) contaminating DNA, most likely no significant homology will be detected. Finally it is possible (5) that some unigenes of the orphans correspond to regulatory non-coding RNAs rather than mRNA, since evidence is accumulating on the ubiquitous role of these non-coding RNAs on translational regulation. Recent estimates in humans state that at least 20% of the genes are regulated by over 1,000 miRNAs (Bentwich et al. 2005; Lim et al. 2005) and in the model nematode C. elegans, 112 miRNA genes have been identified so far (Ruby et al. 2006). One of the most intriguing unigenes lacking homology to any known sequence is notably the largest cluster in our dataset, representing about 8% of the ESTs. In fact it is frequently reported in the literature that the largest clusters in EST analyses contain a disproportionate large number of ESTs [e.g. the largest cluster contained 10% of the ESTs in Mitreva et al. (2004), and 4.7% in Ranganathan et al. (2007)]. Often these oversized clusters do not show homology to any known sequence (Dubreuil et al. 2007; Ranganathan et al. 2007). Unfortunately, not many attempts have been undertaken to clarify this. Based on preliminary sequence data of mitochondrial genome of R. similis, we suggest that cluster 1 most likely has a mitochondrial origin. If so, it should be a part of a transcriptionally active region with a high expression level.

The translation of the unigenes revealed a very high GC3 percentage of 63.4%, while the overall G + C content of the unigenes was approximately 54%. Analysis of genome sequences lead to the thermodynamic stability hypothesis to explain observed differences in G + C content. It states that G + C content is correlated with the optimal growth temperature of the organism (in case of bacteria) or the optimal body temperature (in case of vertebrates) (Jabbari and Bernardi 2004). However, this hypothesis could not be confirmed by different other studies, pointing other unknown more complex grounds for the different G + C content between organisms (Basak and Ghosh 2005; Belle et al. 2002; Wang et al. 2006b). Although at first sight the thermodynamic stability hypothesis could apply to the G + C content of R. similis unigenes, other tropical nematodes (such as Meloidogyne species) have a clearly lower G + C content compared to nematodes occurring in moderate climates such as Heterodera and Caenorhabditis (Mitreva et al. 2006, Table 6).

Table 6 Comparison between four different nematode species

Another remarkable result from this analysis is the impossibility—despite the use of various approaches—to extract sequence fragments from the unigene dataset that resemble trans-spliced leader sequences, known to occur in other nematode species, such as M. incognita and C. elegans. This could indicate that R. similis makes no use of trans-splicing. This is in conflict with the statement that trans-splicing widely occurs throughout the phylum Nematoda. Therefore it is possible that R. similis makes use of a different set of spliced leader sequences, although our attempts to detect these sequences were unsuccessful. The lack of known trans-spliced leader sequences could also explain the difficulties encountered when constructing an oligo(dT)-SL1 PCR based cDNA library of R. similis. This library turned out to have a rather low number of primary transformants, which can be due to the unsuccessful amplification step. Remarkably, a preliminary search for known trans-spliced leaders in the EST data from Heterodera glycines also gave a negative result (data not shown), arguing for a thorough investigation to validate the systematical occurrence of trans-spliced leader sequences (known and unknown) throughout the phylum Nematoda. Some comparisons between R. similis and three other nematode species can be found in Table 6.

Because of the importance of R. similis as a major pest, the EST data can deliver information on the parasitism process as well as on potential control strategies based on disrupting gene expression. Unigenes involved in the plant-parasitic life style of R. similis were found through homology with genes of parasitic species, with known and unknown function. Searching for unigenes coding for secreted proteins has proven a useful approach to identify parasitism genes used by the nematode. It is assumed that the majority of the parasitism proteins are secreted into the host to modulate the nematode’s environment. Most of them originate from the pharyngeal glands and are injected in the plant tissue through the stylet, a hollow needle-like structure in the head of the nematode. A better understanding of the parasitism process will come from identification and characterization of the parasitism genes. In this respect, it is remarkable that the majority of the R. similis unigenes with homology exclusively to PPN EST sequences lack a BLASTx-hit, indicative of a high potential for discovery of novel genes in PPN EST sequences. Besides the elucidation of the parasitism process as step-stone to parasite control, GO and RNAi-phenotype data analysis suggest that suitable targets for controlling R. similis may also be found among genes involved in the regulation of developmental processes. Experiments with RNAi show that viability of nematodes can be severely affected when essential nematode genes are silenced (Kamath et al. 2003). A promising technique in this respect is the in planta generation of nematode specific double stranded RNA, leading to a decreased viability of the nematode when it ingests the inferring RNA molecules (Bakhetia et al. 2005; Gheysen and Vanholme 2007). Therefore, the sequences delivered by this EST project can aid in various ways to establish efficient parasite control.