Introduction

HTS has opened considerable opportunities in the field of genomics for the study of a number of organisms (Martin and Wang 2011; Stapley et al. 2010; Werner 2010). Through massive sequencing approaches, differential expression studies, candidate gene searches, gene ontology annotation, molecular marker development, alternative splicing, genetic variation characterization and SNP development can now be done in non-model species (Ekblom and Galindo 2011). Such information could be of interest for insect vectors in order to improve control strategies. However, comparative studies first require a reference assembly that can either be a genome or a transcriptome.

Concerning the bloodsucking bugs transmitting to human the parasite Trypanosoma cruzi, responsible of the Chagas disease, endemic to Latin America, only one genome that of Rhodnius prolixus, was recently sequenced and assembled (https://www.vectorbase.org/). R. prolixus was the main vector of Chagas disease in Northern Latin America and Central America before its recent eradication (Hashimoto and Schofield 2012). But, in Brazil, since the recently eradication of the main vector, Triatoma infestans, by massive campaigns of chemical control (Dias et al. 2002), health authorities are now concerned about Triatoma brasiliensis, another vector increasingly reported in houses and peridomestic environment (Almeida et al. 2008; Borges et al. 2005; Costa 1999; Costa et al. 1998, 2003). Comparative transcriptomic study based on the use of the newly available high throughput sequencing (HTS) technologies would allow deciphering the molecular bases of the domiciliation process.

Generating such a reference from a non-model species can be complex since assembling de novo the considerable amount of reads generated from HTS is a difficult puzzle. Assembling transcriptomic rather than genomic data may appear easier, facilitating further differential expression studies, as non-coding sequences are absent and the amount of information reduced (Paszkiewicz and Studholme 2010). Nevertheless, alternative splicing, gene duplication or intrinsic polymorphism of biological sample may complicate transcriptome assembly (Sammeth 2009). HTS has different characteristics depending on the technology, which may affect the assembly process. Among the conventional techniques, the substantial length (approximately 342 bp) of 454 reads facilitates contig assembly, but this technique also generates a high frequency of homopolymer repetition errors. Furthermore, read coverage of transcripts remains low with this technique because of the small amount of reads produced: typically around hundreds of thousands of reads (Ekblom and Galindo 2011). As an alternative, Illumina technology sequencing, which may be done on single or paired-end reads, produces a very large amount of reads (typically a billion), generating a high coverage depth. However, the small size of the sequences (about 100 bp) complicates the assembly. Both technologies can be exploited for their respective advantages by combining long reads from 454 with high coverage depth from Illumina data (Martin and Wang 2011).

The choice of assembly software can be complex. Indeed, with the development of HTS, the publication of various de novo transcriptome assemblers, based on more or less similar approaches, has boomed. For example, many software programs are available for long reads typically obtained with 454 technology, including MIRA (Chevreux et al. 2004), CAP3 (Huang and Madan 1999), Newbler (Knudsen et al. 2007), NGen (http://www.dnastar.com/t-nextgen-seqman-ngen.aspx), or CLCbio (http://www.clcbio.com). Among these, three programs are highly recommended: MIRA, which gives very reliable output according to Mundry et al. (2012) and Ren et al. (2012); CAP3 that was suggested as a robust assembling software by Cahais et al. (2012); and the commercial program Newbler, often furnished by the sequencing platform used and recommended by Ren et al. (2012). MIRA is the only program that combines the iterative approach and the all-in-one-step approach. The first step of iterative approach produces an assembly with very stringent parameters (particularly with regard to the quality criteria). Following steps continue with less stringent criteria until the assembly is finished. The all-in-one-step approach starts with all data whatever its quality and generates a consensus using best quality bases as reference. This way, MIRA uses iterative multipass strategies for reads targeted as high or acceptable quality for the assembly. At this step, it detects SNP during the assembly and sorts them by type to reduce assembly mistakes. In a second stage, a fallback strategy uses reads with lower quality if necessary (Chevreux et al. 2004). CAP3 is based on a multiple alignment strategy, taking into account the base quality (Huang and Madan 1999). Newbler starts by identifying overlaps between reads and next builds multiple alignments from overlapping reads to generate preliminary contigs. The final step consists in creating a contig graph to connect contigs together from the edges (Knudsen et al. 2007). To assemble short reads typically produced by Illumina technology, various free software are available, among which some are specifically devoted to transcriptomic data: Trans-Abyss (Robertson et al. 2010), Trinity (Grabherr et al. 2011), Oases (Schulz et al. 2012), Rnnotator (Martin et al. 2010), Multiple-K (Surget-Groba and Montoya-Burgos 2010) or SOAPdenovo-Trans (Xie et al. 2013). Trinity, which was especially designed for transcriptome assembly and offers several advantages: it can solve 99 % of the initial sequencing errors, detects isoforms from the alternative splicing or transcripts recently duplicated and identifies allelic variants while building complete transcripts (Grabherr et al. 2011).

To determine the most appropriate assemblers, various comparative studies have been conducted on biological datasets or on simulated coding DNA sequences (Cahais et al. 2012; Mundry et al. 2012; Ren et al. 2012; Vijay et al. 2013; Zhao et al. 2011). Comparisons were performed based on three groups of criteria. The “computer” criteria include the amount of RAM required and the run time. The quantitative criteria comprise the number of contigs, the total number of bases assembled, the average length of contigs, the N50 (length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs), the number of contigs higher than 1 kb and the number of mapped reads on the final assembly. The qualitative criteria include the completeness of the transcriptome and the presence of chimeric contigs, both parameters being obtained by comparison with a reference genome. The programs considered as the most effective differ between comparative studies depending on datasets and on the weight attributed to different evaluation criteria. Based on a simulated dataset, Vijay et al. (2013) recommended SOAPdenovo, which provides non chimeric, but fragmented, transcriptome, especially when there are a large number of reads and high polymorphism in the dataset (Vijay et al. 2013; Zhao et al. 2011). Among assemblers based on the single-k-mer strategy, Zhao et al. (2011) identified Trinity as the best assembler for Illumina datasets because of the greater number of contigs in the assemblies that can be aligned to the reference genome. For 454 reads from transcriptomes, Ren et al. (2012) recommended MIRA, while Mundry et al. (2012) recommended Newbler and MIRA both showing contrasting specificities. MIRA generated a fragmented and redundant assembly while Newbler built a more complete restoration of full-length transcripts, although it contained large chimeric contigs. Cahais et al. (2012) developed a more complex workflow including data from 454 and Illumina using Abyss and CAP3 assemblers.

All these comparison studies are based on data generated from well-known species. By contrast, without a reference genome, the number of genes and alternative splicing are unknown. Therefore, it is difficult to estimate the expected number of transcripts, the completeness, quality and validity of the assembly or to detect chimeric contigs (Cahais et al. 2012). Definition of assessment methods should help in the choice of assembling strategy. Furthermore, all studies, except Cahais et al. (2012) study, compared assemblies generated from a single sequencing source (Illumina or 454) while both sources have complementary properties that could be combined.

To date, no massive genomic and transcriptomic data are available for T. brasiliensis except the production of a salivary gland cDNA library (Santos et al. 2007). The first step of such an approach is then to produce a reliable reference for T. brasiliensis. In this study, we applied HTS to RNA extracted from antennae, rostrum and head of T. brasiliensis males and females collected in various environments: sylvatic, domiciliary or peri-domiciliary, to generate an optimized chemosensory transcriptome. The generated assembly is to be used as a reference for further comparative transcriptomics. It is known that chemosensation allows bugs to interact with their environment and influences their behavior, thus participating in insect adaptation to new environments and new host choice, playing an important role in anthropogenic adaptation. Studying the expression of the chemosensory transcriptomes from various populations of Triatominae would be an innovative way to address the issue of domiciliation.

For the purpose of getting the best assembly, several types of data and assemblers were tested and compared. Two sequencing technologies were used (454 and Illumina in single or paired-end) with samples from a single individual or pooled individuals. Ten assembly workflows were compared and characterized according to a set of qualitative and quantitative criteria. The hybrid assembly using Illumina paired-end and 454 corrected data in hierarchical workflow using several assemblers was found to optimize both the number of contigs, contig size and the completeness of transcriptome.

Materials and methods

Sampling

Samples of T. brasiliensis were collected in March 2011, in the Caatinga biogeographic zone of the municipality of Caico in the north of Brazil. Five populations were collected in sylvatic, domestic and peridomestic habitats (Table 1).

Table 1 T. brasiliensis samples from which transcriptomes were sequenced and characteristics of assemblies generated with Trinity

RNA extraction and sequencing

Several individuals were pooled by sex and population, except for the SFF sample which consisted of one sylvatic female (Table 1). To target genes involved in the chemosensory system, RNA was extracted from the antennae and rostrum of blood-sucking bugs using the TRIzol® Reagent kit (Invitrogen, Carlsbad, CA, USA). Aliquots of all extraction samples were pooled and the mix was used as template for the construction of a concatamerized, normalized cDNA library prepared and sequenced using 454 Roche GS FLX Titanium (½ Pico Titer Plate, Tbra454, Table 1) by LGC Genomics GmbH (Berlin, Germany). In addition, fourteen libraries were constructed using TruSeq RNA Kit 2010 from Illumina for each RNA sample independently and sequenced with Illumina HiSeq 2000 method (http://support.illumina.com/) in single-end reads of 100 bp by LGC Genomics plateform. One library (SFF sample) was prepared using ScriptSeq™ Complete Kit (Human/Mouse/Rat)—Low Input of Epicentre (http://www.epibio.com/) and sequenced by Imagif (CNRS, Gif-sur-Yvette, France) in paired-end reads (2 × 100 pb) with Illumina Hiseq 1000, using TruSeq PE Cluster Kit v3-cBot-HS and TruSeq SBS Kit v3-HS kits.

Preprocessing

lllumina single-end reads were cleaned with Cutadapt (Martin 2011) and Prinseq (Schmieder and Edwards 2011a). Two sets of thresholds were applied (the gentle “cleaned-1” and very stringent “cleaned-2”) resulting in the deletion of: (1) indefinite bases, (2) the poly A tails (using parameters trim-tail-left 5 for both cleanings; trim-tail-right 5 for cleaned-1 and trim-tail-right 8 for cleaned-2), (3) the small sequences (minimum size of 20 for cleaned-1, 40 for cleaned-2), (4) exact duplicated sequences resulting from PCR (for both cleanings), (5), the low quality ends of the sequences (minimum grade of 20 for cleaned-1, and 28 for cleaned-2), and (6) the low complexity sequences (both cleanings). The cleaned-1 protocol was also applied for 454 reads and Illumina paired-end reads. For samples sequenced in paired-end, reads that became singlets after cleaning were set aside before assembly.

The quality of the sequences was analyzed using FastQC (Supplementary data S1; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/). Part of possible contamination of cleaned-1 reads was evaluated for each single-read sample using Deconseq version 0.4.1 (Schmieder and Edwards 2011b). We used several databases: Human—Reference GRCh37, Human—Celera Genomics, Human—Craig Venter (HuRef), Human—Seong-Jin Kim (Korean), Human—Chromosome 7 version 2 (TCAG), Human—unique from Reference in James Watson, YanHuang (YH; Asian), Yoruba (NA18507; African), Mus musculus C57BL/6 J build 37, Danio rerio Tuebingen version Zv7, 18S sequences (57,317 unique sequences, 08/11/10), viral genomes in RefSeq 45 (3,761 unique sequences, 02/12/11), bacterial genomes (2,206 unique genomes, 02/12/11), archaeal genomes (155 unique genomes, 02/12/11), Salmonella enterica genomes (52 strains, 12/16/10) and bacterial genomes HMP (76,337 WGS sequences, 02/12/11) databases.

Assembly tools

Several assembly programs were used and compared (Fig. 1). 454 reads were assembled using three different assemblers: (1) Newbler, (2) MIRA with the skim_each_pass assembly option, keep_percent_memory_free at 100 and merge_xmltraceinfo (mxti) = no loadreads option (Chevreux et al. 2004), and (3) CAP3 with default settings (Huang and Madan 1999). Illumina reads were assembled using Trinity in all strategies with k-mer_method jellyfish options with 40 GB of system memory, on 4 threads (Marçais and Kingsford 2011; Grabherr et al. 2011). The longest run for the largest dataset took 2 weeks. Finally, MIRA and CAP3 were also used to assemble a merged dataset: raw 454 reads, assembled 454 contigs and Illumina contigs. Assemblies performed on raw data took into account quality information.

Fig. 1
figure 1

Workflow diagram of the different de novo assemblies of the Triatoma brasiliensis transcriptome. “C” notifies contigs, “S–C” means super-contigs and “C–S” contig-singlets. 1 454 reads assembled with Newbler. 2 454 reads assembled with MIRA. 3 454 reads assembled with CAP3. 4 454 contigs and singletons from assemblies 1 and 2 reassembled with CAP3. 5 454 contigs and singletons from assemblies 1, 2 and 3 reassembled with CAP3. 6 Illumina reads from a pool of all populations with cleaned-1 (6a) or cleaned-2 (6b) single-end and Illumina paired-end reads of the SFF sample are assembled with Trinity (contigs higher than 200 bp are conserved). 7 and 8 After processing for assembly 6 (a/b/c), 454 read data are assembled with Illumina contigs using MIRA (assembly 7-a/b/c) or CAP3 (assembly 8-a/b/c). Contigs higher than 200 bp are selected. 9 Illumina contigs from Trinity assembly 6 (a/b/c) are merged with 454 contigs and singletons from assembly 4 and reassembled with CAP3. Contigs higher than 200 bp and singletons higher than 700 bp are selected. A CD-HIT-EST is processed to delete potential duplicates. 10 Same protocol as assembly 9, but here 454 contigs and singletons from assembly 4 are corrected with Illumina reads before reassembly (see Materials and methods section)

Workflow

In a first step, 454 and Illumina data were processed separately (assemblies 1–6; Fig. 1). 454 reads were first assembled with Newbler (assembly 1), MIRA (assembly 2) and CAP3 (assembly 3). Contigs and singletons from assemblies 1 and 2 or from all three assemblies (1, 2 and 3) were pooled and reassembled with CAP3 (assemblies 4 and 5 respectively). Illumina reads were assembled using Trinity (assembly 6). In a second step, the Illumina data were merged with the 454 data to increase contig size (assemblies 7–10; Fig. 1). In the first trial, 454 raw reads were pooled with Illumina contigs from Trinity and used as input to MIRA (assembly 7) or CAP3 (assembly 8). In a second trial, built 454 contigs from assembly 4 were pooled with Illumina contigs from Trinity and reassembled with a second CAP3 (assemblies 9 and 10). This last step led to the building of “super-contigs” which were selected when their length exceeded 200 bp using the –min_len option of Prinseq. Furthermore, some contigs generated from the first steps of 454 and Illumina assemblies were not reassembled in the final CAP3 run and were called contig-singlets. Some of them showed significant length and were therefore conserved (with a threshold of 700 bp). In these two last assemblies (9 and 10), potential duplicates were eliminated with CD-HIT-EST (Li and Godzik 2006).

Illumina data compilation for assembly

Illumina data were compiled in various ways depending on biological samples used (single individual, single population or pooled populations), sequencing approach (single-end or paired-end) or cleaning procedure. Four main ways were used for subsequent assembly: (1) cleaned-1 Illumina single-end read sequences for each population (assemblies 6-a1–6-a14, Table 1); (2) cleaned-1 Illumina paired-end read sequences for a single individual (SFF) (assemblies 6c–10c, Tables 1 and 2); (3) cleaned-1 Illumina single-end reads from all samples pooled (assemblies 6a–10a, Tables 1 and 2) and (4) cleaned-2 llumina single-end reads from all samples pooled (assemblies 6b–10b, Tables 1 and 2). Single Trinity assembly of raw Illumina data was processed for these four types of data (assemblies 6a1–6a14, 6a/b/c). More complex workflows (assemblies 7–10) were only performed on data types (2) (3) and (4). To distinguish between the effect of paired-end sequencing and effect of using a single individual, assembly 6 was also conducted on data source (2) using only forward reads (assembly 6c’, Table 1).

Table 2 Characteristics of the different assemblies

Correction of 454

454 sequencing is known to generate homopolymer tracts. Consequently, the contigs from the 454 assemblies were corrected by aligning Illumina reads on the 454 contigs using BWA (with default settings; Li and Durbin 2009). Polymorphism between 454 contigs and the mapped Illumina reads was collected with the options ‘faids’ and ‘mpileud’ of SAMtools (using default settings; Li et al. 2009). Contigs without polymorphism with mapped Illumina reads or contigs without mapped reads were not changed. 454 contigs showing polymorphism by comparison with Illumina reads were filtered with the bcftools option and corrected: substitutions, duplications or deletions were modified following Illumina read sequences. This step was repeated five times. The correction model was applied in assembly 10.

Indices of assembly quality

Several criteria were used to evaluate the quality of the assemblies. First indices were contig lengths, including N50, total number of bases and contig number. The number of contigs was compared to the 17,155 transcripts estimated from the genome of R. prolixus. The completeness of the transcriptome was also estimated by the presence of sequences belonging to the set of ultra-conserved eukaryotic proteins, tested using the CEGMA pipeline (Core Mapping Approach Eukaryotic Genes; Parra et al. 2007). In addition, we created a script to check the presence of chimeric contigs by reciprocal Blast from the predicted proteome of R. prolixus. We selected contigs and proteins that are best aligned together in both directions and with a Blast e-value smaller than 10−10. Contigs aligned with two different proteins in two loci without significant overlap were considered as potentially chimeric (Fig. 2).

Fig. 2
figure 2

Research for probable chimeric contigs using reciprocal Best Blast Hit method. 1 Reciprocal Blast between transcriptome of T. brasiliensis and the reference proteome of R. prolixus. 2 Selection of pairs of reciprocal best hit: Contig x from T. brasiliensis and protein y from reference proteome formed a Best Blast Hit pair (BBH1) when protein y was the best hit for contig x in the blastx result and when contig x was the best hit for protein y in tblastn result. 3 Removal of proteins that were selected in step 2 (BBH1 proteins) from the reference proteome and new reciprocal blast between transcriptome of T. brasiliensis and partially deleted reference proteome. 4 Selection of new pairs of reciprocal best hit called BBH2. 5 Selection of contigs belonging to both BBH1 and BBH2 and for which alignments of reference proteins on the contig sequence were not overlapping (<50 bp overlap)

Global coverage was measured for all assemblies. The RMBT (reads mapped back to transcripts) were evaluated using BWA (Li and Durbin 2009) and SAMtools (Li et al. 2009).

General annotation

We processed and compared the functional annotation of assemblies 10a and 10c with Blast2GO (Conesa 2005). The blast annotation part was processed in command line using Swiss-Prot database with an E-value threshold of 10−3 and an xml output format.

Chimeric contig validation

In order to resolve the chimeric contig problems of assembly 10-c, we studied the coverage of these contigs looking for atypical coverage profiles that could provide a clue about the misassembly location. To be sure that the atypical coverage profiles observed are specific to chimeric contigs, we compared them to the coverage of highly reliable contigs. Highly reliable contigs must have a unique alignment (with a single gene predicted from R. prolixus genome) of at least 700 bp over 80 % of their total length and an identity rate greater than 60 %. Illumina reads used for assembly 10-c were aligned with both contig categories (chimeric and reliable) with BWA. The coverage of each base was calculated using Bedtools (Quinlan and Hall 2010) and the variation along the contig was drawn with R (R Development Core Team 2008).

SNP

To compare polymorphism level of the samples used in assemblies, we checked for SNP. For this purpose, cleaned-1 reads from (1) Illumina single-end from all samples pooled and (2) Illumina single-end from a single individual (SFF) were mapped to the assembly 10c with BWA. SNP were detected from the 2 bam files generated using pileup function of SAMtools with −C 50 option to downgrade mapping quality for reads containing excessive mismatches. Quality filter of 20 was then applied on the SNP detected with VCFtools (Danecek et al. 2011).

Results

Contig number, total length and RMBT

The 454 library yielded 555,854 reads (Table 1). The fifteen Illumina libraries yielded nearly 360 million single reads and 46 million paired-end reads for nearly 41 Gb of sequences. Analysis with Deconseq revealed that probable contaminant reads varied between 0.78 and 6.85 % (Table 1). Contaminant reads mainly originated from microbes that could be biologically associated with bugs. As these microbes could also be involved in the adaptation process, we maintained such reads in dataset used for assemblies.

The number of contigs varied according to assemblies (Table 2). Assemblies built only with Illumina data (assemblies 6-a, 6-b and 6-c) comprised hundred thousand contigs (256,799; 136,755 and 365,717 contigs, respectively) whereas assemblies built only with 454 reads (assemblies 1–5) comprised 2–10 times less contigs (between 15,825 and 69,917 contigs). Among the 454 assemblies, Newbler assembly (assembly 1) had the lowest number of contigs (15,825) compared with the other 454 assemblies (58,271–69,917 contigs). Combining raw 454 data and assembled Illumina contigs (assemblies 7-a/b/c and 8-a/b/c) led to an intermediate number of contigs (between 56,851 and 92,040). Finally, assemblies merging Illumina and 454 contig data (9-a/b/c and 10-a/b/c) were composed of only tens of thousands of contigs (24,901–57,657). Preprocessing of Illumina reads also influenced the final number of contigs: the more stringent process (cleaned-2) generated less contigs (b assemblies) than the less stringent process (cleaned-1, a assemblies), whatever the assembly protocol. The total number of bases was similar between assemblies except for those obtained by Newbler (assembly 1) and Trinity (assembly 6), which were the two extremes, with 9.8 million and 179 million bases, respectively.

Among the Trinity assemblies generated from a single population data (assemblies 6-a1–6-a14; Table 1), assembly 6-a11 (SFC1) presented much higher number of contigs (66,830) and total number of bases (about 23 million bases) than other population samples. Inversely, assembly 6-a6 (PMD) comprised a low number of contigs (3544) and had a small total length (about 1 million of bases) probably due to the small amount of RNA extracted (Table 1). Technical replicates showed contrasted results (Table 1), the highest discrepancy being observed between SFC1 and SFC2 assemblies (66,830 contigs and 22,901,564 bp; 10,682 contigs and 3,041,341 bp, respectively). Generally, the highest number of post-cleaned reads led to the longest assembly and may explain this discrepancy although the reverse was observed for DMB1 and DMB2. Furthermore, low sequencing depth for single population samples may explain the lack of convergence between replicates. The transcriptome from the pool of Illumina single reads for all populations presents a high number of contigs with very short lengths. A stricter cleaning of the sequences did not necessarily improve the result (assembly 6-a vs. 6-b, Table 2). Overall, the total number of bases was directly related to the total number of cleaned reads assembled. By contrast, the Illumina paired sample SFF assembled with Trinity presented more contigs than the assembly generated from pooled population samples (365,717 compared to 256,799; Table 1), despite a smaller number of reads (15,762,629 R1 + 15,762,629 R2 vs. 74,386,931). This also held true when SFF was assembled as single-reads. Improvement of assembly length observed from SFF data may result from (1) the lower heterozygosity of the sample as a single individual was sequenced, (2) the type of tissue sequenced, as brain was added to antennae and rostrum for RNA extraction and also (3) from paired-end technology.

The RMBT is indicative of the part of the raw data that was used to build the assembly and therefore partly follows the total number of bases. For example, assembly 1 which showed the smallest number of bases assembled also showed the smallest RMBT (18.09 %; Table 2). On the other hand, assembly 6c’ had the highest RMBT (71.09 %) but Assembly 6c with the highest total number of bases showed a moderate RBMT (53.65 %). This resulted from the highest constraint in mapping paired-end read as mapping of associated reads had to be concordant. Except for these 3 assemblies, RMBT roughly ranged between 30 and 40 % including assembly 10c (39.6 %), which showed the best RBMT among assemblies combining 454 and Illumina data (assemblies 7–10), despite the constraint mentioned above in mapping paired-end reads.

N50 and length distribution

Length distribution of contigs from assemblies 1, 2, 4, 6-a/c, 8-a/c, 10-a/c is shown in Fig. 3 and metrics for all assemblies are given in Tables 1 and 2. N50 and length distribution differed between data sources (454; Illumina single or Illumina paired) and assembly processes. Assemblies 2–5 of the 454 reads presented similar distributions of contig length (Fig. 3) with a N50 around 500 bp (Table 2). N50 was higher for assembly 1 (662 bp) than other 454-based assemblies. In assemblies of Illumina single-end reads 6-a and 6-b, the most frequent class was between 200 and 300 bp, in agreement with N50 results (313 bp; Table 2). Assemblies of Illumina single-end reads from each population (assemblies 6-a1–6-a14; Table 1) or from the pool of all populations (6-a and 6-b, Table 2) presented similar N50 (about 300 bp). Assemblies performed with pooled Illumina contigs from single-end reads and 454 reads (7-a/b and 8-a/b) presented N50 and length distribution similar to 454 read assemblies alone (1–5). The protocol merging Illumina contigs from single-end reads and 454 contigs (9-a/b and 10-a/b) produced longer contigs (N50 from 917 to 919). Contig length distribution of these last assemblies appeared bimodal (Fig. 3) due to the addition of contig-singlets longer than 700 bp (Fig. 1). When similar workflows are compared, N50 were higher for assemblies generated from one individual (SFF sample) than for those from a pool of populations (6-c to 10-c vs. 6-a to 10-a and 6-b–10-b Table 2; 6-c and 6-c′ vs. 6-a and 6-b, Table 1).

Fig. 3
figure 3

Length distribution of contigs from a subset of assemblies. N50 is indicated by the dotted line

Annotation and completeness

A sequence identified as coding for one of the 248 proteins of the set of ultra-conserved eukaryotic proteins and with alignment length higher than a pre-computed minimum alignment score was named “partial”. If the alignment length was higher than 70 % of the protein length it is called “complete”. The set of “partial” matches necessarily includes the set of “complete” matches. Sample SFC1 differed from other population samples in the CEGMA result (assembly 6-a11; Table 1). It presented 56 % of the protein set whereas in other samples only 1.21–24 % of the protein set were counted. The completeness was mainly related to the total number of bases in the assembly and therefore highly variable from one sample to the other even between replicates. Sample SFF from a single individual presented a CEGMA that was clearly higher (89.52 %) than other samples from the pool of individuals from the same population. The best result was found for assembly 6-c: 93.95 % of the set of eukaryotic conserved proteins were found and 81.45 % of sequences were complete (Table 2). This improvement may result from the better quality of the assembly (low heterozygosity and paired-end data) but also from the biological representativeness of the sample: in the SFF sample, the whole head was used for RNA extraction instead of only the antennae and rostrum. The other assemblies presented from 37.1 to 89.52 % of the CEGMA set and 25–79.44 % sequences were complete (Table 2). Except for assembly 8-b, Illumina single-end reads gave better results when cleaned with the less stringent protocol cleaned-1 than with cleaned-2 (Table 2). The correction of 454 contigs with Illumina reads may increase the partial CEGMA result up to 6 % (assembly 10-a vs. 9-a, Table 2).

Number of potentially chimeric contigs

Whatever the reads (454 or Illumina single or paired), the proportion of potentially chimeric contigs remained below 0.2 % (assemblies 1–6). By contrast, the assemblers influenced this proportion. The Newbler assembler led to a higher proportion of potentially chimeric contigs than the other 454 assemblers, CAP3 or MIRA (0.19 % compared to an average of 0.05 %). More significantly, the complex assemblies increased the proportion of potentially chimeric contigs up to 1 % (7-c; 8-b; 9-a, b, c and 10-a, b, c).

Chimeric contigs coverage

All contigs analyzed showed heterogeneous coverage (Fig. 4). Among the potentially chimeric contigs, we found three types of graphs. An example of the first type is shown with Contig 9883, where a marked drop in the average coverage is observed (around the 800th base). This region is located between the two alignments with two different R. prolixus putative genes. Comp1346_c0_seq 1 is an example of the second type of graph: two areas are well differentiated in their coverage. The first of these areas is from the 1st to 1,500th base pair and has a relatively low coverage (up to 100 reads/base) compared to the second, which extends from the 1,500 to the 2,500th base pair and has a higher coverage of up to 600 reads/base. These two regions correspond to those of the two Blast hits. The third type of graph is illustrated with Contig 4,175, which does not show a real difference in coverage between the two aligned regions nor a drop located between Blast hits. This last type of graph is the most frequent and is similar to the one obtained from the highly reliable contig comp1694_c0_seq 1.

Fig. 4
figure 4

Graphical coverage representation. 13 coverage of 3 potential chimeric contigs, 4 highly reliable contig. The regions with a significant reciprocal Blast hit to the R. prolixus reference genome are bounded by vertical lines and indicated by arrows. Coverage of bases is shown in grey. The black line represents the average coverage over 100 base pair windows with a 50 bp overlap

SNP

The population diversity was evaluated estimating the SNP number of all samples pooled and it was compared to the individual diversity estimated using the SFF sample. Mapping of single-end reads is not constrained in the same way as mapping of paired-end reads which could distort the comparison. We therefore used only forward reads of the SFF sample for SNP identification. SFF forward reads and pool of all population samples present quite similar coverage (46,051,230 vs. 74,386,931 reads). We used a quality filter of 20 for SNP identification and found 53,991 SNPs for SFF and 66,279 for all samples pooled representing one SNP every 1,189 and 969 bp, respectively. As expected, the number of SNP was lower for SFF (a single diploid individual) than for the pool data (mix of 63 diploid individuals). However, the polymorphism level of the single SFF appears quite high compared to the population sample.

Characteristics of the best assembly

Combining the different metrics presented, the assembly showing the best compromise between contig distribution, completeness and percentage of chimeric contigs was 10-c. In order to estimate the number of the coding genes contained in the best assembly consisting of 57,657 contigs, we performed a Blast against R. prolixus peptide dataset. Among the 18,642 T. brasiliensis contigs with a peptide blast hit, 8,125 peptides were retrieved (43.6 %). Assuming this non-redundant transcript rate as representative of the complete dataset, the estimated number of the coding genes is about 25,000 in this assembly.

GO annotations for assembly-10c are presented in Fig. 5. Among the 57,657 contigs, 16,612 (28,8 %) gave significant blast hit, 11,378 (19,7 %) were aligned with a GO term and finally 7,117 (12,3 %) were annotated. The biological process was composed of a majority of single-organism process (15 %), metabolic process (14 %), and cellular process (19 %) transcripts. The cellular component was predominantly represented by organelle (28 %) and cell (39 %). Binding and catalytic activities represented the major molecular functions (48 and 36 % respectively). One contig was annotated as virion cellular component. Similar results were found for assembly 10a (Supplementary data S2).

Fig. 5
figure 5

Distribution of T. brasiliensis contigs from assembly 10c annotated at GO level 2. (Color figure online)

Discussion

In a comparative study of non-model organism genomics, the conclusions drawn from the analysis of HTS data partly depend on the quality of the assembly (Stapley et al. 2010; Vijay et al. 2013). Major information may be missed out from an incomplete assembly. Moreover, transcripts that are improperly assembled and/or scattered across different contigs may complicate the read mapping and skew differential expression results. Thus, it is fundamentally important to construct the most optimized assembly from an available dataset. In order to conduct a comparative study of the different T. brasiliensis populations in the future, the first step was to assemble the transcriptome of this species. In this article, we tested and compared several assembly methods from different datasets. The quality of these assemblies was assessed according to numerous criteria: number of transcripts, transcript size, completeness, and chimeric contig occurrence based on the predicted proteome of R. prolixus.

Influence of assembly and workflow software

Concerning 454 data assemblies, Newbler generated longer contigs than MIRA (N50 of 662 vs. 491 bp). Cahais et al. (2012) found similar results when assembling 454 reads from Ciona intestinalis, as did Ren et al. (2012) on simulated data and 454 reads from Trychophyton rubrum. Newbler is more stringent in filtering contig size. The size distribution appeared therefore moved toward longer contig size (increased N50) but this resulted mainly from the elimination of small contigs as well as the genetic information they included (Cahais et al. 2012). This is typically in line with what we observed in assembly 1. Our results suggest that, despite being longer, Newbler contigs were less informative than those generated using MIRA (CEGMA partial 37 vs. 43.55 %), in agreement with the results of Ren et al. (2012) and Cahais et al. (2012). However, in this latter study, the highest N50 and contig number aligned to the reference were found using CAP3. In our study, CAP3 produced similar results to those of MIRA. Several software programs are also available to assemble Illumina data. According to Zhao et al. (2011), the assembler Trans-Abyss is a good compromise between RAM need, runtime and assembly quality. Cahais et al. (2012) also endorse this assembly software. Oases is another assembler frequently used (Feldmeyer et al. 2011; Francis et al. 2013; Haznedaroglu et al. 2012; Zhao et al. 2011). We tested Oases and Trans-Abyss to assemble T. brasiliensis Illumina reads with different k-mer sizes and default options. The assemblies generated (data not shown) were either similar or less informative than those obtained with Trinity. Therefore, we decided to use only this latter software.

Completeness and annotation of the transcriptome

The expected distribution of transcript number and length may be estimated from the reference genome of R. prolixus, which has been recently sequenced and assembled (www.vectorbase.org). The predicted transcriptome of this species comprises 17,155 genes with a mean transcript length of 1,016.5 bp and a maximum length of 34,028 bp. Therefore, we expected to find the same orders of magnitude for T. brasiliensis transcripts. All transcriptome assemblies (except assembly 1) had a contig number and a total number of bases higher than that of R. prolixus predicted gene set defined from the genome. Transcriptome assemblies from other Hemiptera species also exhibit a large number of contigs. For example, assembly of transcriptome data comprised 54,748 contigs for Trialeurodes vaporariorum with an average length of 965 bp (Karatolos et al. 2011) and 35,646 expressed sequence tags with an average contig length of 759 bp in the bed bug Cimex lectularius (Bai et al. 2011). In more distant arthropod groups, 51,999 contigs were assembled for Sesamia nonagrioides (Glaser et al. 2013), 128,391 for Theridion grallator and 104,481 for Theridion californucum (Croucher et al. 2013). The discrepancies between transcript and gene numbers could be attributed to alternative splicing events producing several different transcripts from a single gene and increasing the transcriptome size. Such a result was observed by Hull et al. (2013) who found 36,131 transcript isoforms corresponding to 19,742 genes in western tarnished plant bug Lygus hesperus. However, the estimated gene number of T. brasiliensis transcriptome (about 25,000) was still higher than the one of R. prolixus. The gene content between the two bug species may be somewhat different as they are quite distant: genetic duplications specific to T. brasiliensis may have occured followed by gain of function. The difference may also result from an underestimation of the gene content of the recently sequenced genome of R. prolixus. A CEGMA test we performed on the predicted transcriptome of R. prolixus gave a result of 14.11 % complete proteins and 17.34 % partial proteins. This suggests that the putative gene set of R. prolixus is still incomplete.

According to CEGMA, assemblies from 454 (assemblies 1–5, Table 2) and those from Illumina single-reads of populations (assemblies 6-a1–6-a14, Table 1) gave very poorly informative results. Overall, the CEGMA increased with the number of initial sample reads. The transcriptome built from pool of single reads from all population samples (assembly 6-a) has a high completeness rate (CEGMA: 83.47 % partial and 60.08 % complete), but one lower than the assembly constructed from a single sample (SFF), while the sequencing depth of this latter sample is lower. According to Francis et al. (2013) the transcriptome completeness using CEGMA values increases with the sequencing depth, but decreases beyond a threshold (estimated between 20 and 30 million reads). An excess of reads may complicate assembly and lead to errors due to the occurrence of misassembled or chimeric cases. However, transcriptomes of spiders Theridion grallor and Theridion californicum were assembled from about 250 million Illumina reads each and showed CEGMA results of 99 and 98 % complete, respectively (Croucher et al. 2013). It is more likely that appropriate read number to maximize the assembly completeness is variable and intrinsic to each organism depending on the complexity of the genome, such as the number of transcribed genes, their average size and the gene expression level (Zhao et al. 2011). Moreover, the completeness of the transcriptome also varies depending on the tissues targeted for the extraction. Francis et al. (2013) found a higher percentage of conserved eukaryotic orthologous genes using a transcriptome built from a complete individual RNA than from a single tissue (90 vs. 63–81 %), reflecting that genes may not always be expressed at a sufficient level in all tissues. Our population samples were derived from antennae and mouthparts of bugs, while brain tissue was also targeted for the SFF sample in order to obtain enough RNA material from a single individual. This addition of brain tissue may have improved the completeness of the transcriptome.

For this SFF sample, two other parameters could also explain its CEGMA value. First, this sample comprises a single individual, which decreases the polymorphism rate and facilitates assembly. Indeed, SNP identification reveals a higher polymorphism in the pooled population sample than in the individual sample. Second, the paired-end technology used for this assembly has a positive effect on the completeness of the transcriptome as revealed by the comparison with the single-end assembly for this sample. According to Sengupta et al. (2011), paired-end technology makes it possible to detect alternative splice junctions, insertions and deletions, and facilitates the de novo assembly of the transcriptome. However, single-end sequencing is still very widely used because of its lower cost.

Despite differences in completeness and polymorphism, the type of samples did not change the GO annotation. The proportion of biological process, cellular composition and molecular function annotations are similar between assemblies 10-a and 10-c and comparable to those found for other chemosensory transcriptomes of insects, such as Sodoptera littoralis (Poivet et al. 2013) and S. nonagrioides (Glaser et al. 2013). Equivalent repartitions were also found in transcriptomes from guts or full-body of Aphis glycines (Liu et al. 2012).

Chimeric contigs

Among the assemblies tested, the characteristics of assembly 10-c appeared to offer the best compromise, taking into account the contig numbers, the N50 and the CEGMA values. This assembly generated contigs up to 47,145 bp, a number compatible with the largest predicted transcript for R. prolixus (34,028 bp, www.vectorbase.org), but the occurrence of chimeric contigs could not be excluded. Overall, the number of potential chimeric contigs increased with the complexity of the assembly workflows (assemblies 7–10), but also in relation to the number of available annotated genes. For example, assemblies of 454 EST simulated from the human genome generated with Newbler, CAP3 or MIRA gave high percentages of chimeric contigs (14 % with MIRA, 15 % with CAP3 and 38 % with Newbler; Mundry et al. 2012). However, in the Mundry et al. (2012) study, assemblies and reference are issued from the same genome, which enhances the power of chimera detection. In our case, the reference genome is quite distant, which should reduce the percentage of chimeric contigs detected. Conversely, the assembly and annotation of the R. prolixus genome is unfinished, which could artificially increase the percentage of false chimeric contigs. Owing to both observations, the number of potential chimera detected (up to 1 %) seems satisfactory.

The occurrence of chimeric contigs had been examined by studying their coverage. For some of the chimeric contigs, the significant drop in coverage could be interpreted as the boundary between two different transcripts assembled in a single contig. For some others, two regions were differentiated by distinct coverage that may be a consequence of two differentially expressed transcripts being erroneously assembled. It could be also the result of alternative splicing with the presence of an exon over-expressed compared to others because it is included in several isoforms. Even in this case where the chimera nature of the contig is not established, it may be preferable to separate both exons to better estimate differential expression level. That is why, in these two coverage profiles, chimeric contigs were cut in order to generate two unlinked transcripts. However, in most cases, the coverage of potentially chimeric contigs was very heterogeneous with no clear boundary. Such heterogeneity was also found in contigs aligned on only one reference gene with a high reliability, which questions the chimeric nature of this contig class. Moreover, in some rare cases, the fusion of mRNA can be generated during trans-splicing events (Bonen 1993). Such fusion would be interpreted as chimeric contigs in our protocol while it corresponds to a true transcript. Consequently, the percentage of potential chimeric contigs is probably overestimated.

Conclusion

Our study documents how a high quality assembly results from various parameters including the sampling procedure, the sequencing strategy and the assembly workflow. Despite it is tempting to perform assembly on population pool with the aim to better appreciate global genetic information, we do not recommend such an approach. Instead, we showed that the use of a single individual with large representation of biological tissues is preferable to improve both the total number of bases assembled and the completeness of the assembly. In this case, the assembly procedure is facilitated by decreasing polymorphism rate and the representation of the whole transcriptome is improved by the diversity of tissues sampled. Sequencing technologies now require small amount of RNA allowing to sequence only one individual even for species or tissues of small size. In addition to the choice of biological material, we highly recommend a mixed sequencing strategy merging both long reads as generated by 454 or the recent MiSeq Illumina technology (that allows up 2 × 250pb fragments in paired-end reads) and short but paired-end Illumina reads (2 × 100 pb fragments). We showed that the well-documented biases of 454 technology (such as homopolymers) may be corrected using Illumina sequences and that utilization of both technologies greatly improves length distribution of contigs generated. The size distribution is also increased by the usage of paired-end rather than single-end technology. Finally, the usage of various assemblers is also a powerful approach combining the specific advantages of each as far as duplicate contigs are eliminated in the final step.