De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease

Marchant, A.; Mougel, F.; Almeida, C.; Jacquin-Joly, E.; Costa, J.; Harry, M.

doi:10.1007/s10709-014-9790-5

De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease

Published: 19 September 2014

Volume 143, pages 225–239, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Genetica Aims and scope Submit manuscript

De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease

Download PDF

A. Marchant^1,2,
F. Mougel^1,2,
C. Almeida³,
E. Jacquin-Joly⁴,
J. Costa⁵ &
…
M. Harry^1,2

972 Accesses
20 Citations
1 Altmetric
Explore all metrics

Abstract

High throughput sequencing (HTS) provides new research opportunities for work on non-model organisms, such as differential expression studies between populations exposed to different environmental conditions. However, such transcriptomic studies first require the production of a reference assembly. The choice of sampling procedure, sequencing strategy and assembly workflow is crucial. To develop a reliable reference transcriptome for Triatoma brasiliensis, the major Chagas disease vector in Northeastern Brazil, different de novo assembly protocols were generated using various datasets and software. Both 454 and Illumina sequencing technologies were applied on RNA extracted from antennae and mouthparts from single or pooled individuals. The 454 library yielded 278 Mb. Fifteen Illumina libraries were constructed and yielded nearly 360 million RNA-seq single reads and 46 million RNA-seq paired-end reads for nearly 45 Gb. For the 454 reads, we used three assemblers, Newbler, CAP3 and/or MIRA and for the Illumina reads, the Trinity assembler. Ten assembly workflows were compared using these programs separately or in combination. To compare the assemblies obtained, quantitative and qualitative criteria were used, including contig length, N50, contig number and the percentage of chimeric contigs. Completeness of the assemblies was estimated using the CEGMA pipeline. The best assembly (57,657 contigs, completeness of 80 %, <1 % chimeric contigs) was a hybrid assembly leading to recommend the use of (1) a single individual with large representation of biological tissues, (2) merging both long reads and short paired-end Illumina reads, (3) several assemblers in order to combine the specific advantages of each.

Hybrid de novo whole-genome assembly and annotation of the model tapeworm Hymenolepis diminuta

Article Open access 03 December 2019

Improved annotation with de novo transcriptome assembly in four social amoeba species

Article Open access 31 January 2017

Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation

Article Open access 18 November 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

HTS has opened considerable opportunities in the field of genomics for the study of a number of organisms (Martin and Wang 2011; Stapley et al. 2010; Werner 2010). Through massive sequencing approaches, differential expression studies, candidate gene searches, gene ontology annotation, molecular marker development, alternative splicing, genetic variation characterization and SNP development can now be done in non-model species (Ekblom and Galindo 2011). Such information could be of interest for insect vectors in order to improve control strategies. However, comparative studies first require a reference assembly that can either be a genome or a transcriptome.

Concerning the bloodsucking bugs transmitting to human the parasite Trypanosoma cruzi, responsible of the Chagas disease, endemic to Latin America, only one genome that of Rhodnius prolixus, was recently sequenced and assembled (https://www.vectorbase.org/). R. prolixus was the main vector of Chagas disease in Northern Latin America and Central America before its recent eradication (Hashimoto and Schofield 2012). But, in Brazil, since the recently eradication of the main vector, Triatoma infestans, by massive campaigns of chemical control (Dias et al. 2002), health authorities are now concerned about Triatoma brasiliensis, another vector increasingly reported in houses and peridomestic environment (Almeida et al. 2008; Borges et al. 2005; Costa 1999; Costa et al. 1998, 2003). Comparative transcriptomic study based on the use of the newly available high throughput sequencing (HTS) technologies would allow deciphering the molecular bases of the domiciliation process.

Generating such a reference from a non-model species can be complex since assembling de novo the considerable amount of reads generated from HTS is a difficult puzzle. Assembling transcriptomic rather than genomic data may appear easier, facilitating further differential expression studies, as non-coding sequences are absent and the amount of information reduced (Paszkiewicz and Studholme 2010). Nevertheless, alternative splicing, gene duplication or intrinsic polymorphism of biological sample may complicate transcriptome assembly (Sammeth 2009). HTS has different characteristics depending on the technology, which may affect the assembly process. Among the conventional techniques, the substantial length (approximately 342 bp) of 454 reads facilitates contig assembly, but this technique also generates a high frequency of homopolymer repetition errors. Furthermore, read coverage of transcripts remains low with this technique because of the small amount of reads produced: typically around hundreds of thousands of reads (Ekblom and Galindo 2011). As an alternative, Illumina technology sequencing, which may be done on single or paired-end reads, produces a very large amount of reads (typically a billion), generating a high coverage depth. However, the small size of the sequences (about 100 bp) complicates the assembly. Both technologies can be exploited for their respective advantages by combining long reads from 454 with high coverage depth from Illumina data (Martin and Wang 2011).

The choice of assembly software can be complex. Indeed, with the development of HTS, the publication of various de novo transcriptome assemblers, based on more or less similar approaches, has boomed. For example, many software programs are available for long reads typically obtained with 454 technology, including MIRA (Chevreux et al. 2004), CAP3 (Huang and Madan 1999), Newbler (Knudsen et al. 2007), NGen (http://www.dnastar.com/t-nextgen-seqman-ngen.aspx), or CLCbio (http://www.clcbio.com). Among these, three programs are highly recommended: MIRA, which gives very reliable output according to Mundry et al. (2012) and Ren et al. (2012); CAP3 that was suggested as a robust assembling software by Cahais et al. (2012); and the commercial program Newbler, often furnished by the sequencing platform used and recommended by Ren et al. (2012). MIRA is the only program that combines the iterative approach and the all-in-one-step approach. The first step of iterative approach produces an assembly with very stringent parameters (particularly with regard to the quality criteria). Following steps continue with less stringent criteria until the assembly is finished. The all-in-one-step approach starts with all data whatever its quality and generates a consensus using best quality bases as reference. This way, MIRA uses iterative multipass strategies for reads targeted as high or acceptable quality for the assembly. At this step, it detects SNP during the assembly and sorts them by type to reduce assembly mistakes. In a second stage, a fallback strategy uses reads with lower quality if necessary (Chevreux et al. 2004). CAP3 is based on a multiple alignment strategy, taking into account the base quality (Huang and Madan 1999). Newbler starts by identifying overlaps between reads and next builds multiple alignments from overlapping reads to generate preliminary contigs. The final step consists in creating a contig graph to connect contigs together from the edges (Knudsen et al. 2007). To assemble short reads typically produced by Illumina technology, various free software are available, among which some are specifically devoted to transcriptomic data: Trans-Abyss (Robertson et al. 2010), Trinity (Grabherr et al. 2011), Oases (Schulz et al. 2012), Rnnotator (Martin et al. 2010), Multiple-K (Surget-Groba and Montoya-Burgos 2010) or SOAPdenovo-Trans (Xie et al. 2013). Trinity, which was especially designed for transcriptome assembly and offers several advantages: it can solve 99 % of the initial sequencing errors, detects isoforms from the alternative splicing or transcripts recently duplicated and identifies allelic variants while building complete transcripts (Grabherr et al. 2011).

To determine the most appropriate assemblers, various comparative studies have been conducted on biological datasets or on simulated coding DNA sequences (Cahais et al. 2012; Mundry et al. 2012; Ren et al. 2012; Vijay et al. 2013; Zhao et al. 2011). Comparisons were performed based on three groups of criteria. The “computer” criteria include the amount of RAM required and the run time. The quantitative criteria comprise the number of contigs, the total number of bases assembled, the average length of contigs, the N50 (length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs), the number of contigs higher than 1 kb and the number of mapped reads on the final assembly. The qualitative criteria include the completeness of the transcriptome and the presence of chimeric contigs, both parameters being obtained by comparison with a reference genome. The programs considered as the most effective differ between comparative studies depending on datasets and on the weight attributed to different evaluation criteria. Based on a simulated dataset, Vijay et al. (2013) recommended SOAPdenovo, which provides non chimeric, but fragmented, transcriptome, especially when there are a large number of reads and high polymorphism in the dataset (Vijay et al. 2013; Zhao et al. 2011). Among assemblers based on the single-k-mer strategy, Zhao et al. (2011) identified Trinity as the best assembler for Illumina datasets because of the greater number of contigs in the assemblies that can be aligned to the reference genome. For 454 reads from transcriptomes, Ren et al. (2012) recommended MIRA, while Mundry et al. (2012) recommended Newbler and MIRA both showing contrasting specificities. MIRA generated a fragmented and redundant assembly while Newbler built a more complete restoration of full-length transcripts, although it contained large chimeric contigs. Cahais et al. (2012) developed a more complex workflow including data from 454 and Illumina using Abyss and CAP3 assemblers.

All these comparison studies are based on data generated from well-known species. By contrast, without a reference genome, the number of genes and alternative splicing are unknown. Therefore, it is difficult to estimate the expected number of transcripts, the completeness, quality and validity of the assembly or to detect chimeric contigs (Cahais et al. 2012). Definition of assessment methods should help in the choice of assembling strategy. Furthermore, all studies, except Cahais et al. (2012) study, compared assemblies generated from a single sequencing source (Illumina or 454) while both sources have complementary properties that could be combined.

To date, no massive genomic and transcriptomic data are available for T. brasiliensis except the production of a salivary gland cDNA library (Santos et al. 2007). The first step of such an approach is then to produce a reliable reference for T. brasiliensis. In this study, we applied HTS to RNA extracted from antennae, rostrum and head of T. brasiliensis males and females collected in various environments: sylvatic, domiciliary or peri-domiciliary, to generate an optimized chemosensory transcriptome. The generated assembly is to be used as a reference for further comparative transcriptomics. It is known that chemosensation allows bugs to interact with their environment and influences their behavior, thus participating in insect adaptation to new environments and new host choice, playing an important role in anthropogenic adaptation. Studying the expression of the chemosensory transcriptomes from various populations of Triatominae would be an innovative way to address the issue of domiciliation.

For the purpose of getting the best assembly, several types of data and assemblers were tested and compared. Two sequencing technologies were used (454 and Illumina in single or paired-end) with samples from a single individual or pooled individuals. Ten assembly workflows were compared and characterized according to a set of qualitative and quantitative criteria. The hybrid assembly using Illumina paired-end and 454 corrected data in hierarchical workflow using several assemblers was found to optimize both the number of contigs, contig size and the completeness of transcriptome.

Materials and methods

Sampling

Samples of T. brasiliensis were collected in March 2011, in the Caatinga biogeographic zone of the municipality of Caico in the north of Brazil. Five populations were collected in sylvatic, domestic and peridomestic habitats (Table 1).

Table 1 T. brasiliensis samples from which transcriptomes were sequenced and characteristics of assemblies generated with Trinity

Full size table

RNA extraction and sequencing

Several individuals were pooled by sex and population, except for the SFF sample which consisted of one sylvatic female (Table 1). To target genes involved in the chemosensory system, RNA was extracted from the antennae and rostrum of blood-sucking bugs using the TRIzol^® Reagent kit (Invitrogen, Carlsbad, CA, USA). Aliquots of all extraction samples were pooled and the mix was used as template for the construction of a concatamerized, normalized cDNA library prepared and sequenced using 454 Roche GS FLX Titanium (½ Pico Titer Plate, Tbra454, Table 1) by LGC Genomics GmbH (Berlin, Germany). In addition, fourteen libraries were constructed using TruSeq RNA Kit 2010 from Illumina for each RNA sample independently and sequenced with Illumina HiSeq 2000 method (http://support.illumina.com/) in single-end reads of 100 bp by LGC Genomics plateform. One library (SFF sample) was prepared using ScriptSeq™ Complete Kit (Human/Mouse/Rat)—Low Input of Epicentre (http://www.epibio.com/) and sequenced by Imagif (CNRS, Gif-sur-Yvette, France) in paired-end reads (2 × 100 pb) with Illumina Hiseq 1000, using TruSeq PE Cluster Kit v3-cBot-HS and TruSeq SBS Kit v3-HS kits.

Preprocessing

lllumina single-end reads were cleaned with Cutadapt (Martin 2011) and Prinseq (Schmieder and Edwards 2011a). Two sets of thresholds were applied (the gentle “cleaned-1” and very stringent “cleaned-2”) resulting in the deletion of: (1) indefinite bases, (2) the poly A tails (using parameters trim-tail-left 5 for both cleanings; trim-tail-right 5 for cleaned-1 and trim-tail-right 8 for cleaned-2), (3) the small sequences (minimum size of 20 for cleaned-1, 40 for cleaned-2), (4) exact duplicated sequences resulting from PCR (for both cleanings), (5), the low quality ends of the sequences (minimum grade of 20 for cleaned-1, and 28 for cleaned-2), and (6) the low complexity sequences (both cleanings). The cleaned-1 protocol was also applied for 454 reads and Illumina paired-end reads. For samples sequenced in paired-end, reads that became singlets after cleaning were set aside before assembly.

The quality of the sequences was analyzed using FastQC (Supplementary data S1; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/). Part of possible contamination of cleaned-1 reads was evaluated for each single-read sample using Deconseq version 0.4.1 (Schmieder and Edwards 2011b). We used several databases: Human—Reference GRCh37, Human—Celera Genomics, Human—Craig Venter (HuRef), Human—Seong-Jin Kim (Korean), Human—Chromosome 7 version 2 (TCAG), Human—unique from Reference in James Watson, YanHuang (YH; Asian), Yoruba (NA18507; African), Mus musculus C57BL/6 J build 37, Danio rerio Tuebingen version Zv7, 18S sequences (57,317 unique sequences, 08/11/10), viral genomes in RefSeq 45 (3,761 unique sequences, 02/12/11), bacterial genomes (2,206 unique genomes, 02/12/11), archaeal genomes (155 unique genomes, 02/12/11), Salmonella enterica genomes (52 strains, 12/16/10) and bacterial genomes HMP (76,337 WGS sequences, 02/12/11) databases.

Assembly tools

Several assembly programs were used and compared (Fig. 1). 454 reads were assembled using three different assemblers: (1) Newbler, (2) MIRA with the skim_each_pass assembly option, keep_percent_memory_free at 100 and merge_xmltraceinfo (mxti) = no loadreads option (Chevreux et al. 2004), and (3) CAP3 with default settings (Huang and Madan 1999). Illumina reads were assembled using Trinity in all strategies with k-mer_method jellyfish options with 40 GB of system memory, on 4 threads (Marçais and Kingsford 2011; Grabherr et al. 2011). The longest run for the largest dataset took 2 weeks. Finally, MIRA and CAP3 were also used to assemble a merged dataset: raw 454 reads, assembled 454 contigs and Illumina contigs. Assemblies performed on raw data took into account quality information.

Workflow

In a first step, 454 and Illumina data were processed separately (assemblies 1–6; Fig. 1). 454 reads were first assembled with Newbler (assembly 1), MIRA (assembly 2) and CAP3 (assembly 3). Contigs and singletons from assemblies 1 and 2 or from all three assemblies (1, 2 and 3) were pooled and reassembled with CAP3 (assemblies 4 and 5 respectively). Illumina reads were assembled using Trinity (assembly 6). In a second step, the Illumina data were merged with the 454 data to increase contig size (assemblies 7–10; Fig. 1). In the first trial, 454 raw reads were pooled with Illumina contigs from Trinity and used as input to MIRA (assembly 7) or CAP3 (assembly 8). In a second trial, built 454 contigs from assembly 4 were pooled with Illumina contigs from Trinity and reassembled with a second CAP3 (assemblies 9 and 10). This last step led to the building of “super-contigs” which were selected when their length exceeded 200 bp using the –min_len option of Prinseq. Furthermore, some contigs generated from the first steps of 454 and Illumina assemblies were not reassembled in the final CAP3 run and were called contig-singlets. Some of them showed significant length and were therefore conserved (with a threshold of 700 bp). In these two last assemblies (9 and 10), potential duplicates were eliminated with CD-HIT-EST (Li and Godzik 2006).

Illumina data compilation for assembly

Illumina data were compiled in various ways depending on biological samples used (single individual, single population or pooled populations), sequencing approach (single-end or paired-end) or cleaning procedure. Four main ways were used for subsequent assembly: (1) cleaned-1 Illumina single-end read sequences for each population (assemblies 6-a1–6-a14, Table 1); (2) cleaned-1 Illumina paired-end read sequences for a single individual (SFF) (assemblies 6c–10c, Tables 1 and 2); (3) cleaned-1 Illumina single-end reads from all samples pooled (assemblies 6a–10a, Tables 1 and 2) and (4) cleaned-2 llumina single-end reads from all samples pooled (assemblies 6b–10b, Tables 1 and 2). Single Trinity assembly of raw Illumina data was processed for these four types of data (assemblies 6a1–6a14, 6a/b/c). More complex workflows (assemblies 7–10) were only performed on data types (2) (3) and (4). To distinguish between the effect of paired-end sequencing and effect of using a single individual, assembly 6 was also conducted on data source (2) using only forward reads (assembly 6c’, Table 1).

Table 2 Characteristics of the different assemblies

Full size table

Correction of 454

454 sequencing is known to generate homopolymer tracts. Consequently, the contigs from the 454 assemblies were corrected by aligning Illumina reads on the 454 contigs using BWA (with default settings; Li and Durbin 2009). Polymorphism between 454 contigs and the mapped Illumina reads was collected with the options ‘faids’ and ‘mpileud’ of SAMtools (using default settings; Li et al. 2009). Contigs without polymorphism with mapped Illumina reads or contigs without mapped reads were not changed. 454 contigs showing polymorphism by comparison with Illumina reads were filtered with the bcftools option and corrected: substitutions, duplications or deletions were modified following Illumina read sequences. This step was repeated five times. The correction model was applied in assembly 10.

Indices of assembly quality

Several criteria were used to evaluate the quality of the assemblies. First indices were contig lengths, including N50, total number of bases and contig number. The number of contigs was compared to the 17,155 transcripts estimated from the genome of R. prolixus. The completeness of the transcriptome was also estimated by the presence of sequences belonging to the set of ultra-conserved eukaryotic proteins, tested using the CEGMA pipeline (Core Mapping Approach Eukaryotic Genes; Parra et al. 2007). In addition, we created a script to check the presence of chimeric contigs by reciprocal Blast from the predicted proteome of R. prolixus. We selected contigs and proteins that are best aligned together in both directions and with a Blast e-value smaller than 10⁻¹⁰. Contigs aligned with two different proteins in two loci without significant overlap were considered as potentially chimeric (Fig. 2).

Global coverage was measured for all assemblies. The RMBT (reads mapped back to transcripts) were evaluated using BWA (Li and Durbin 2009) and SAMtools (Li et al. 2009).

General annotation

We processed and compared the functional annotation of assemblies 10a and 10c with Blast2GO (Conesa 2005). The blast annotation part was processed in command line using Swiss-Prot database with an E-value threshold of 10⁻³ and an xml output format.

Chimeric contig validation

In order to resolve the chimeric contig problems of assembly 10-c, we studied the coverage of these contigs looking for atypical coverage profiles that could provide a clue about the misassembly location. To be sure that the atypical coverage profiles observed are specific to chimeric contigs, we compared them to the coverage of highly reliable contigs. Highly reliable contigs must have a unique alignment (with a single gene predicted from R. prolixus genome) of at least 700 bp over 80 % of their total length and an identity rate greater than 60 %. Illumina reads used for assembly 10-c were aligned with both contig categories (chimeric and reliable) with BWA. The coverage of each base was calculated using Bedtools (Quinlan and Hall 2010) and the variation along the contig was drawn with R (R Development Core Team 2008).

SNP

To compare polymorphism level of the samples used in assemblies, we checked for SNP. For this purpose, cleaned-1 reads from (1) Illumina single-end from all samples pooled and (2) Illumina single-end from a single individual (SFF) were mapped to the assembly 10c with BWA. SNP were detected from the 2 bam files generated using pileup function of SAMtools with −C 50 option to downgrade mapping quality for reads containing excessive mismatches. Quality filter of 20 was then applied on the SNP detected with VCFtools (Danecek et al. 2011).

Results

Contig number, total length and RMBT

The 454 library yielded 555,854 reads (Table 1). The fifteen Illumina libraries yielded nearly 360 million single reads and 46 million paired-end reads for nearly 41 Gb of sequences. Analysis with Deconseq revealed that probable contaminant reads varied between 0.78 and 6.85 % (Table 1). Contaminant reads mainly originated from microbes that could be biologically associated with bugs. As these microbes could also be involved in the adaptation process, we maintained such reads in dataset used for assemblies.

The number of contigs varied according to assemblies (Table 2). Assemblies built only with Illumina data (assemblies 6-a, 6-b and 6-c) comprised hundred thousand contigs (256,799; 136,755 and 365,717 contigs, respectively) whereas assemblies built only with 454 reads (assemblies 1–5) comprised 2–10 times less contigs (between 15,825 and 69,917 contigs). Among the 454 assemblies, Newbler assembly (assembly 1) had the lowest number of contigs (15,825) compared with the other 454 assemblies (58,271–69,917 contigs). Combining raw 454 data and assembled Illumina contigs (assemblies 7-a/b/c and 8-a/b/c) led to an intermediate number of contigs (between 56,851 and 92,040). Finally, assemblies merging Illumina and 454 contig data (9-a/b/c and 10-a/b/c) were composed of only tens of thousands of contigs (24,901–57,657). Preprocessing of Illumina reads also influenced the final number of contigs: the more stringent process (cleaned-2) generated less contigs (b assemblies) than the less stringent process (cleaned-1, a assemblies), whatever the assembly protocol. The total number of bases was similar between assemblies except for those obtained by Newbler (assembly 1) and Trinity (assembly 6), which were the two extremes, with 9.8 million and 179 million bases, respectively.

Among the Trinity assemblies generated from a single population data (assemblies 6-a1–6-a14; Table 1), assembly 6-a11 (SFC1) presented much higher number of contigs (66,830) and total number of bases (about 23 million bases) than other population samples. Inversely, assembly 6-a6 (PMD) comprised a low number of contigs (3544) and had a small total length (about 1 million of bases) probably due to the small amount of RNA extracted (Table 1). Technical replicates showed contrasted results (Table 1), the highest discrepancy being observed between SFC1 and SFC2 assemblies (66,830 contigs and 22,901,564 bp; 10,682 contigs and 3,041,341 bp, respectively). Generally, the highest number of post-cleaned reads led to the longest assembly and may explain this discrepancy although the reverse was observed for DMB1 and DMB2. Furthermore, low sequencing depth for single population samples may explain the lack of convergence between replicates. The transcriptome from the pool of Illumina single reads for all populations presents a high number of contigs with very short lengths. A stricter cleaning of the sequences did not necessarily improve the result (assembly 6-a vs. 6-b, Table 2). Overall, the total number of bases was directly related to the total number of cleaned reads assembled. By contrast, the Illumina paired sample SFF assembled with Trinity presented more contigs than the assembly generated from pooled population samples (365,717 compared to 256,799; Table 1), despite a smaller number of reads (15,762,629 R1 + 15,762,629 R2 vs. 74,386,931). This also held true when SFF was assembled as single-reads. Improvement of assembly length observed from SFF data may result from (1) the lower heterozygosity of the sample as a single individual was sequenced, (2) the type of tissue sequenced, as brain was added to antennae and rostrum for RNA extraction and also (3) from paired-end technology.

The RMBT is indicative of the part of the raw data that was used to build the assembly and therefore partly follows the total number of bases. For example, assembly 1 which showed the smallest number of bases assembled also showed the smallest RMBT (18.09 %; Table 2). On the other hand, assembly 6c’ had the highest RMBT (71.09 %) but Assembly 6c with the highest total number of bases showed a moderate RBMT (53.65 %). This resulted from the highest constraint in mapping paired-end read as mapping of associated reads had to be concordant. Except for these 3 assemblies, RMBT roughly ranged between 30 and 40 % including assembly 10c (39.6 %), which showed the best RBMT among assemblies combining 454 and Illumina data (assemblies 7–10), despite the constraint mentioned above in mapping paired-end reads.

N50 and length distribution

Length distribution of contigs from assemblies 1, 2, 4, 6-a/c, 8-a/c, 10-a/c is shown in Fig. 3 and metrics for all assemblies are given in Tables 1 and 2. N50 and length distribution differed between data sources (454; Illumina single or Illumina paired) and assembly processes. Assemblies 2–5 of the 454 reads presented similar distributions of contig length (Fig. 3) with a N50 around 500 bp (Table 2). N50 was higher for assembly 1 (662 bp) than other 454-based assemblies. In assemblies of Illumina single-end reads 6-a and 6-b, the most frequent class was between 200 and 300 bp, in agreement with N50 results (313 bp; Table 2). Assemblies of Illumina single-end reads from each population (assemblies 6-a1–6-a14; Table 1) or from the pool of all populations (6-a and 6-b, Table 2) presented similar N50 (about 300 bp). Assemblies performed with pooled Illumina contigs from single-end reads and 454 reads (7-a/b and 8-a/b) presented N50 and length distribution similar to 454 read assemblies alone (1–5). The protocol merging Illumina contigs from single-end reads and 454 contigs (9-a/b and 10-a/b) produced longer contigs (N50 from 917 to 919). Contig length distribution of these last assemblies appeared bimodal (Fig. 3) due to the addition of contig-singlets longer than 700 bp (Fig. 1). When similar workflows are compared, N50 were higher for assemblies generated from one individual (SFF sample) than for those from a pool of populations (6-c to 10-c vs. 6-a to 10-a and 6-b–10-b Table 2; 6-c and 6-c′ vs. 6-a and 6-b, Table 1).

Annotation and completeness

A sequence identified as coding for one of the 248 proteins of the set of ultra-conserved eukaryotic proteins and with alignment length higher than a pre-computed minimum alignment score was named “partial”. If the alignment length was higher than 70 % of the protein length it is called “complete”. The set of “partial” matches necessarily includes the set of “complete” matches. Sample SFC1 differed from other population samples in the CEGMA result (assembly 6-a11; Table 1). It presented 56 % of the protein set whereas in other samples only 1.21–24 % of the protein set were counted. The completeness was mainly related to the total number of bases in the assembly and therefore highly variable from one sample to the other even between replicates. Sample SFF from a single individual presented a CEGMA that was clearly higher (89.52 %) than other samples from the pool of individuals from the same population. The best result was found for assembly 6-c: 93.95 % of the set of eukaryotic conserved proteins were found and 81.45 % of sequences were complete (Table 2). This improvement may result from the better quality of the assembly (low heterozygosity and paired-end data) but also from the biological representativeness of the sample: in the SFF sample, the whole head was used for RNA extraction instead of only the antennae and rostrum. The other assemblies presented from 37.1 to 89.52 % of the CEGMA set and 25–79.44 % sequences were complete (Table 2). Except for assembly 8-b, Illumina single-end reads gave better results when cleaned with the less stringent protocol cleaned-1 than with cleaned-2 (Table 2). The correction of 454 contigs with Illumina reads may increase the partial CEGMA result up to 6 % (assembly 10-a vs. 9-a, Table 2).

Number of potentially chimeric contigs

Whatever the reads (454 or Illumina single or paired), the proportion of potentially chimeric contigs remained below 0.2 % (assemblies 1–6). By contrast, the assemblers influenced this proportion. The Newbler assembler led to a higher proportion of potentially chimeric contigs than the other 454 assemblers, CAP3 or MIRA (0.19 % compared to an average of 0.05 %). More significantly, the complex assemblies increased the proportion of potentially chimeric contigs up to 1 % (7-c; 8-b; 9-a, b, c and 10-a, b, c).

Chimeric contigs coverage

All contigs analyzed showed heterogeneous coverage (Fig. 4). Among the potentially chimeric contigs, we found three types of graphs. An example of the first type is shown with Contig 9883, where a marked drop in the average coverage is observed (around the 800th base). This region is located between the two alignments with two different R. prolixus putative genes. Comp1346_c0_seq 1 is an example of the second type of graph: two areas are well differentiated in their coverage. The first of these areas is from the 1st to 1,500th base pair and has a relatively low coverage (up to 100 reads/base) compared to the second, which extends from the 1,500 to the 2,500th base pair and has a higher coverage of up to 600 reads/base. These two regions correspond to those of the two Blast hits. The third type of graph is illustrated with Contig 4,175, which does not show a real difference in coverage between the two aligned regions nor a drop located between Blast hits. This last type of graph is the most frequent and is similar to the one obtained from the highly reliable contig comp1694_c0_seq 1.

SNP

The population diversity was evaluated estimating the SNP number of all samples pooled and it was compared to the individual diversity estimated using the SFF sample. Mapping of single-end reads is not constrained in the same way as mapping of paired-end reads which could distort the comparison. We therefore used only forward reads of the SFF sample for SNP identification. SFF forward reads and pool of all population samples present quite similar coverage (46,051,230 vs. 74,386,931 reads). We used a quality filter of 20 for SNP identification and found 53,991 SNPs for SFF and 66,279 for all samples pooled representing one SNP every 1,189 and 969 bp, respectively. As expected, the number of SNP was lower for SFF (a single diploid individual) than for the pool data (mix of 63 diploid individuals). However, the polymorphism level of the single SFF appears quite high compared to the population sample.

Characteristics of the best assembly

Combining the different metrics presented, the assembly showing the best compromise between contig distribution, completeness and percentage of chimeric contigs was 10-c. In order to estimate the number of the coding genes contained in the best assembly consisting of 57,657 contigs, we performed a Blast against R. prolixus peptide dataset. Among the 18,642 T. brasiliensis contigs with a peptide blast hit, 8,125 peptides were retrieved (43.6 %). Assuming this non-redundant transcript rate as representative of the complete dataset, the estimated number of the coding genes is about 25,000 in this assembly.

GO annotations for assembly-10c are presented in Fig. 5. Among the 57,657 contigs, 16,612 (28,8 %) gave significant blast hit, 11,378 (19,7 %) were aligned with a GO term and finally 7,117 (12,3 %) were annotated. The biological process was composed of a majority of single-organism process (15 %), metabolic process (14 %), and cellular process (19 %) transcripts. The cellular component was predominantly represented by organelle (28 %) and cell (39 %). Binding and catalytic activities represented the major molecular functions (48 and 36 % respectively). One contig was annotated as virion cellular component. Similar results were found for assembly 10a (Supplementary data S2).

Discussion

In a comparative study of non-model organism genomics, the conclusions drawn from the analysis of HTS data partly depend on the quality of the assembly (Stapley et al. 2010; Vijay et al. 2013). Major information may be missed out from an incomplete assembly. Moreover, transcripts that are improperly assembled and/or scattered across different contigs may complicate the read mapping and skew differential expression results. Thus, it is fundamentally important to construct the most optimized assembly from an available dataset. In order to conduct a comparative study of the different T. brasiliensis populations in the future, the first step was to assemble the transcriptome of this species. In this article, we tested and compared several assembly methods from different datasets. The quality of these assemblies was assessed according to numerous criteria: number of transcripts, transcript size, completeness, and chimeric contig occurrence based on the predicted proteome of R. prolixus.

Influence of assembly and workflow software

Concerning 454 data assemblies, Newbler generated longer contigs than MIRA (N50 of 662 vs. 491 bp). Cahais et al. (2012) found similar results when assembling 454 reads from Ciona intestinalis, as did Ren et al. (2012) on simulated data and 454 reads from Trychophyton rubrum. Newbler is more stringent in filtering contig size. The size distribution appeared therefore moved toward longer contig size (increased N50) but this resulted mainly from the elimination of small contigs as well as the genetic information they included (Cahais et al. 2012). This is typically in line with what we observed in assembly 1. Our results suggest that, despite being longer, Newbler contigs were less informative than those generated using MIRA (CEGMA partial 37 vs. 43.55 %), in agreement with the results of Ren et al. (2012) and Cahais et al. (2012). However, in this latter study, the highest N50 and contig number aligned to the reference were found using CAP3. In our study, CAP3 produced similar results to those of MIRA. Several software programs are also available to assemble Illumina data. According to Zhao et al. (2011), the assembler Trans-Abyss is a good compromise between RAM need, runtime and assembly quality. Cahais et al. (2012) also endorse this assembly software. Oases is another assembler frequently used (Feldmeyer et al. 2011; Francis et al. 2013; Haznedaroglu et al. 2012; Zhao et al. 2011). We tested Oases and Trans-Abyss to assemble T. brasiliensis Illumina reads with different k-mer sizes and default options. The assemblies generated (data not shown) were either similar or less informative than those obtained with Trinity. Therefore, we decided to use only this latter software.

Completeness and annotation of the transcriptome

The expected distribution of transcript number and length may be estimated from the reference genome of R. prolixus, which has been recently sequenced and assembled (www.vectorbase.org). The predicted transcriptome of this species comprises 17,155 genes with a mean transcript length of 1,016.5 bp and a maximum length of 34,028 bp. Therefore, we expected to find the same orders of magnitude for T. brasiliensis transcripts. All transcriptome assemblies (except assembly 1) had a contig number and a total number of bases higher than that of R. prolixus predicted gene set defined from the genome. Transcriptome assemblies from other Hemiptera species also exhibit a large number of contigs. For example, assembly of transcriptome data comprised 54,748 contigs for Trialeurodes vaporariorum with an average length of 965 bp (Karatolos et al. 2011) and 35,646 expressed sequence tags with an average contig length of 759 bp in the bed bug Cimex lectularius (Bai et al. 2011). In more distant arthropod groups, 51,999 contigs were assembled for Sesamia nonagrioides (Glaser et al. 2013), 128,391 for Theridion grallator and 104,481 for Theridion californucum (Croucher et al. 2013). The discrepancies between transcript and gene numbers could be attributed to alternative splicing events producing several different transcripts from a single gene and increasing the transcriptome size. Such a result was observed by Hull et al. (2013) who found 36,131 transcript isoforms corresponding to 19,742 genes in western tarnished plant bug Lygus hesperus. However, the estimated gene number of T. brasiliensis transcriptome (about 25,000) was still higher than the one of R. prolixus. The gene content between the two bug species may be somewhat different as they are quite distant: genetic duplications specific to T. brasiliensis may have occured followed by gain of function. The difference may also result from an underestimation of the gene content of the recently sequenced genome of R. prolixus. A CEGMA test we performed on the predicted transcriptome of R. prolixus gave a result of 14.11 % complete proteins and 17.34 % partial proteins. This suggests that the putative gene set of R. prolixus is still incomplete.

According to CEGMA, assemblies from 454 (assemblies 1–5, Table 2) and those from Illumina single-reads of populations (assemblies 6-a1–6-a14, Table 1) gave very poorly informative results. Overall, the CEGMA increased with the number of initial sample reads. The transcriptome built from pool of single reads from all population samples (assembly 6-a) has a high completeness rate (CEGMA: 83.47 % partial and 60.08 % complete), but one lower than the assembly constructed from a single sample (SFF), while the sequencing depth of this latter sample is lower. According to Francis et al. (2013) the transcriptome completeness using CEGMA values increases with the sequencing depth, but decreases beyond a threshold (estimated between 20 and 30 million reads). An excess of reads may complicate assembly and lead to errors due to the occurrence of misassembled or chimeric cases. However, transcriptomes of spiders Theridion grallor and Theridion californicum were assembled from about 250 million Illumina reads each and showed CEGMA results of 99 and 98 % complete, respectively (Croucher et al. 2013). It is more likely that appropriate read number to maximize the assembly completeness is variable and intrinsic to each organism depending on the complexity of the genome, such as the number of transcribed genes, their average size and the gene expression level (Zhao et al. 2011). Moreover, the completeness of the transcriptome also varies depending on the tissues targeted for the extraction. Francis et al. (2013) found a higher percentage of conserved eukaryotic orthologous genes using a transcriptome built from a complete individual RNA than from a single tissue (90 vs. 63–81 %), reflecting that genes may not always be expressed at a sufficient level in all tissues. Our population samples were derived from antennae and mouthparts of bugs, while brain tissue was also targeted for the SFF sample in order to obtain enough RNA material from a single individual. This addition of brain tissue may have improved the completeness of the transcriptome.

For this SFF sample, two other parameters could also explain its CEGMA value. First, this sample comprises a single individual, which decreases the polymorphism rate and facilitates assembly. Indeed, SNP identification reveals a higher polymorphism in the pooled population sample than in the individual sample. Second, the paired-end technology used for this assembly has a positive effect on the completeness of the transcriptome as revealed by the comparison with the single-end assembly for this sample. According to Sengupta et al. (2011), paired-end technology makes it possible to detect alternative splice junctions, insertions and deletions, and facilitates the de novo assembly of the transcriptome. However, single-end sequencing is still very widely used because of its lower cost.

Despite differences in completeness and polymorphism, the type of samples did not change the GO annotation. The proportion of biological process, cellular composition and molecular function annotations are similar between assemblies 10-a and 10-c and comparable to those found for other chemosensory transcriptomes of insects, such as Sodoptera littoralis (Poivet et al. 2013) and S. nonagrioides (Glaser et al. 2013). Equivalent repartitions were also found in transcriptomes from guts or full-body of Aphis glycines (Liu et al. 2012).

Chimeric contigs

Among the assemblies tested, the characteristics of assembly 10-c appeared to offer the best compromise, taking into account the contig numbers, the N50 and the CEGMA values. This assembly generated contigs up to 47,145 bp, a number compatible with the largest predicted transcript for R. prolixus (34,028 bp, www.vectorbase.org), but the occurrence of chimeric contigs could not be excluded. Overall, the number of potential chimeric contigs increased with the complexity of the assembly workflows (assemblies 7–10), but also in relation to the number of available annotated genes. For example, assemblies of 454 EST simulated from the human genome generated with Newbler, CAP3 or MIRA gave high percentages of chimeric contigs (14 % with MIRA, 15 % with CAP3 and 38 % with Newbler; Mundry et al. 2012). However, in the Mundry et al. (2012) study, assemblies and reference are issued from the same genome, which enhances the power of chimera detection. In our case, the reference genome is quite distant, which should reduce the percentage of chimeric contigs detected. Conversely, the assembly and annotation of the R. prolixus genome is unfinished, which could artificially increase the percentage of false chimeric contigs. Owing to both observations, the number of potential chimera detected (up to 1 %) seems satisfactory.

The occurrence of chimeric contigs had been examined by studying their coverage. For some of the chimeric contigs, the significant drop in coverage could be interpreted as the boundary between two different transcripts assembled in a single contig. For some others, two regions were differentiated by distinct coverage that may be a consequence of two differentially expressed transcripts being erroneously assembled. It could be also the result of alternative splicing with the presence of an exon over-expressed compared to others because it is included in several isoforms. Even in this case where the chimera nature of the contig is not established, it may be preferable to separate both exons to better estimate differential expression level. That is why, in these two coverage profiles, chimeric contigs were cut in order to generate two unlinked transcripts. However, in most cases, the coverage of potentially chimeric contigs was very heterogeneous with no clear boundary. Such heterogeneity was also found in contigs aligned on only one reference gene with a high reliability, which questions the chimeric nature of this contig class. Moreover, in some rare cases, the fusion of mRNA can be generated during trans-splicing events (Bonen 1993). Such fusion would be interpreted as chimeric contigs in our protocol while it corresponds to a true transcript. Consequently, the percentage of potential chimeric contigs is probably overestimated.

Conclusion

Our study documents how a high quality assembly results from various parameters including the sampling procedure, the sequencing strategy and the assembly workflow. Despite it is tempting to perform assembly on population pool with the aim to better appreciate global genetic information, we do not recommend such an approach. Instead, we showed that the use of a single individual with large representation of biological tissues is preferable to improve both the total number of bases assembled and the completeness of the assembly. In this case, the assembly procedure is facilitated by decreasing polymorphism rate and the representation of the whole transcriptome is improved by the diversity of tissues sampled. Sequencing technologies now require small amount of RNA allowing to sequence only one individual even for species or tissues of small size. In addition to the choice of biological material, we highly recommend a mixed sequencing strategy merging both long reads as generated by 454 or the recent MiSeq Illumina technology (that allows up 2 × 250pb fragments in paired-end reads) and short but paired-end Illumina reads (2 × 100 pb fragments). We showed that the well-documented biases of 454 technology (such as homopolymers) may be corrected using Illumina sequences and that utilization of both technologies greatly improves length distribution of contigs generated. The size distribution is also increased by the usage of paired-end rather than single-end technology. Finally, the usage of various assemblers is also a powerful approach combining the specific advantages of each as far as duplicate contigs are eliminated in the final step.

References

Almeida CE, Pacheco RS, Haag K et al (2008) Inferring from the Cyt B gene the Triatoma brasiliensis Neiva, 1911 (Hemiptera: Reduviidae: Triatominae) genetic structure and domiciliary infestation in the state of Paraíba, Brazil. Am J Trop Med Hyg 78:791–802
CAS PubMed Google Scholar
Bai X, Mamidala P, Rajarapu SP et al (2011) Transcriptomics of the bed bug (Cimex lectularius). PLoS ONE 6:e16336. doi:10.1371/journal.pone.0016336
Article PubMed Central CAS PubMed Google Scholar
Bonen L (1993) Trans-splicing of pre-mRNA in plants, animals, and protists. FASEB J 7:40–46
CAS PubMed Google Scholar
Borges ÉC, Dujardin J-P, Schofield CJ et al (2005) Dynamics between sylvatic, peridomestic and domestic populations of Triatoma brasiliensis (Hemiptera: Reduviidae) in Ceará State, Northeastern Brazil. Acta Trop 93:119–126. doi:10.1016/j.actatropica.2004.10.002
Article PubMed Google Scholar
Cahais V, Gayral P, Tsagkogeorga G et al (2012) Reference-free transcriptome assembly in non-model animals from next-generation sequencing data. Mol Ecol Resour 12:834–845. doi:10.1111/j.1755-0998.2012.03148.x
Article CAS PubMed Google Scholar
Chevreux B, Pfisterer T, Drescher B et al (2004) Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 14:1147–1159. doi:10.1101/gr.1917404
Article PubMed Central CAS PubMed Google Scholar
Conesa (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21
Costa J (1999) The synanthropic process of Chagas disease vectors in Brazil, with special attention to Triatoma brasiliensis Neiva, 1911 (Hemiptera, Reduviidae, Triatominae) population, genetical, ecological, and epidemiological aspects. Mem Inst Oswaldo Cruz 94:239–241. doi:10.1590/S0074-02761999000700038
Article PubMed Google Scholar
Costa J, de Almeida JR, Britto C et al (1998) Ecotopes, natural infection and trophic resources of Triatoma brasiliensis (Hemiptera, Reduviidae, Triatominae). Mem Inst Oswaldo Cruz 93:7–13
Article CAS PubMed Google Scholar
Costa J, Almeida CE, Dotson EM et al (2003) The epidemiologic importance of Triatoma brasiliensis as a Chagas disease vector in Brazil: a revision of domiciliary captures during 1993-1999. Mem Inst Oswaldo Cruz 98:443–449. doi:10.1590/S0074-02762003000400002
Article PubMed Google Scholar
Croucher PJ, Brewer MS, Winchell CJ et al (2013) de novo characterization of the gene-rich transcriptomes of two color-polymorphic spiders, Theridion grallator and T. californicum (Araneae: Theridiidae), with special reference to pigment genes. BMC Genom 14:862. doi:10.1186/1471-2164-14-862
Article Google Scholar
Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158. doi:10.1093/bioinformatics/btr330
Article PubMed Central CAS PubMed Google Scholar
Development Core Team R (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Dias JCP, Silveira AC, Schofield CJ (2002) The impact of Chagas disease control in Latin America: a review. Mem Inst Oswaldo Cruz 97:603–612. doi:10.1590/S0074-02762002000500002
Article CAS PubMed Google Scholar
Ekblom R, Galindo J (2011) Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 107:1–15. doi:10.1038/hdy.2010.152
Article PubMed Central CAS PubMed Google Scholar
Feldmeyer B, Wheat CW, Krezdorn N et al (2011) Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genom 12:317. doi:10.1186/1471-2164-12-317
Article Google Scholar
Francis WR, Christianson LM, Kiko R et al (2013) A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly. BMC Genom 14:167. doi:10.1186/1471-2164-14-167
Article CAS Google Scholar
Glaser N, Gallot A, Legeai F et al (2013) Candidate chemosensory genes in the stemborer Sesamia nonagrioides. Int J Biol Sci 9:481–495. doi:10.7150/ijbs.6109
Article PubMed Central PubMed Google Scholar
Grabherr MG, Haas BJ, Yassour M et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652. doi:10.1038/nbt.1883
Article PubMed Central CAS PubMed Google Scholar
Hashimoto K, Schofield CJ (2012) Elimination of Rhodnius prolixus in Central America. Parasit Vectors 5:45. doi:10.1186/1756-3305-5-45
Article PubMed Central PubMed Google Scholar
Haznedaroglu BZ, Reeves D, Rismani-Yazdi H, Peccia J (2012) Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms. BMC Bioinform 13:170. doi:10.1186/1471-2105-13-170
Article Google Scholar
Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868–877. doi:10.1101/gr.9.9.868
Article PubMed Central CAS PubMed Google Scholar
Hull JJ, Geib SM, Fabrick JA, Brent CS (2013) Sequencing and de novo assembly of the western tarnished plant bug (Lygus hesperus) transcriptome. PLoS ONE 8:e55105. doi:10.1371/journal.pone.0055105
Article PubMed Central PubMed Google Scholar
Karatolos N, Pauchet Y, Wilkinson P et al (2011) Pyrosequencing the transcriptome of the greenhouse whitefly, Trialeurodes vaporariorum reveals multiple transcripts encoding insecticide targets and detoxifying enzymes. BMC Genom 12:56. doi:10.1186/1471-2164-12-56
Article CAS Google Scholar
Knudsen B, Knudsen T, Flensborg M et al (2007) CLC Genomics Workbench. Version 5:5
Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi:10.1093/bioinformatics/btp324
Article PubMed Central CAS PubMed Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. doi:10.1093/bioinformatics/btl158
Article CAS PubMed Google Scholar
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352
Article PubMed Central PubMed Google Scholar
Liu S, Chougule NP, Vijayendran D, Bonning BC (2012) Deep sequencing of the transcriptomes of Soybean aphid and associated endosymbionts. PLoS ONE 7:e45161. doi:10.1371/journal.pone.0045161
Article PubMed Central CAS PubMed Google Scholar
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770. doi:10.1093/bioinformatics/btr011
Article PubMed Central PubMed Google Scholar
Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17:10–12
Article Google Scholar
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682. doi:10.1038/nrg3068
Article CAS PubMed Google Scholar
Martin J, Bruno VM, Fang Z et al (2010) Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genom 11:663. doi:10.1186/1471-2164-11-663
Article CAS Google Scholar
Mundry M, Bornberg-Bauer E, Sammeth M, Feulner PGD (2012) Evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach. PLoS ONE 7:e31410. doi:10.1371/journal.pone.0031410
Article PubMed Central CAS PubMed Google Scholar
Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23:1061–1067. doi:10.1093/bioinformatics/btm071
Article CAS PubMed Google Scholar
Paszkiewicz K, Studholme DJ (2010) De novo assembly of short sequence reads. Brief Bioinform 11:457–472. doi:10.1093/bib/bbq020
Article CAS PubMed Google Scholar
Poivet E, Gallot A, Montagné N et al (2013) A comparison of the olfactory gene repertoires of adults and larvae in the noctuid moth Spodoptera littoralis. PLoS ONE 8:e60263. doi:10.1371/journal.pone.0060263
Article PubMed Central CAS PubMed Google Scholar
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842. doi:10.1093/bioinformatics/btq033
Article PubMed Central CAS PubMed Google Scholar
Ren X, Liu T, Dong J et al (2012) Evaluating de Bruijn graph assemblers on 454 transcriptomic data. PLoS ONE 7:e51188. doi:10.1371/journal.pone.0051188
Article PubMed Central CAS PubMed Google Scholar
Robertson G, Schein J, Chiu R et al (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7:909–912. doi:10.1038/nmeth.1517
Article CAS PubMed Google Scholar
Sammeth M (2009) Complete alternative splicing events are bubbles in splicing graphs. J Comput Biol 16:1117–1140. doi:10.1089/cmb.2009.0108
Article CAS PubMed Google Scholar
Santos A, Ribeiro JMC, Lehane MJ et al (2007) The sialotranscriptome of the blood-sucking bug Triatoma brasiliensis (Hemiptera, Triatominae). Insect Biochem Mol Biol 37:702–712. doi:10.1016/j.ibmb.2007.04.004
Article PubMed Central CAS PubMed Google Scholar
Schmieder R, Edwards R (2011a) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27:863–864. doi:10.1093/bioinformatics/btr026
Article PubMed Central CAS PubMed Google Scholar
Schmieder R, Edwards R (2011b) Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE 6:e17288. doi:10.1371/journal.pone.0017288
Article PubMed Central CAS PubMed Google Scholar
Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28:1086–1092. doi:10.1093/bioinformatics/bts094
Article PubMed Central CAS PubMed Google Scholar
Sengupta S, Bolin JM, Ruotti V et al (2011) Single read and paired end mRNA-seq Illumina libraries from 10 nanograms total RNA. J Vis Exp. doi:10.3791/3340
Google Scholar
Stapley J, Reger J, Feulner PGD et al (2010) Adaptation genomics: the next generation. Trends Ecol Evol 25:705–712. doi:10.1016/j.tree.2010.09.002
Article PubMed Google Scholar
Surget-Groba Y, Montoya-Burgos JI (2010) Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res 20:1432–1440. doi:10.1101/gr.103846.109
Article PubMed Central CAS PubMed Google Scholar
Vijay N, Poelstra JW, Künstner A, Wolf JBW (2013) Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol Ecol 22:620–634. doi:10.1111/mec.12014
Article CAS PubMed Google Scholar
Werner T (2010) Next generation sequencing in functional genomics. Brief Bioinform 11:499–511. doi:10.1093/bib/bbq018
Article CAS PubMed Google Scholar
Xie Y, Wu G, Tang J, et al. (2013) SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. ArXiv13056760 Q-Bio
Zhao Q-Y, Wang Y, Kong Y-M et al (2011) Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinform 12:S2. doi:10.1186/1471-2105-12-S14-S2
Article CAS Google Scholar

Download references

Acknowledgments

We would like to thank Rachel Legendre and Claire Toffano of Institut de Génétique et Microbiologie CNRS - UMR 8621 who gave us the script for 454 contig correction. We thank Marie-Christine François (iEES, INRA Versailles, France) for help with the T. brasiliensis RNA extractions. The authors are also very grateful to the engineers of the bioinformatics platforms Genouest at the University of Rennes 1 and eBio of the University Paris Sud for technical support. This work has benefited from the facilities and expertise of the HTS platform of IMAGIF (Centre de Recherche de Gif - www.imagif.cnrs.fr. This study was funded by the French Agence Nationale de la Recherche (ADAPTANTHROP project, ANR-097-PEXT-009) and supported by the labex Biodiversité, Agroécosystèmes, Société, Climat (BASC; University Paris Saclay, France). Marchant A. was funded by the Idex Paris Saclay, France.

Conflict of interest

The authors announce that they have not a financial relationship with the organization that sponsored the research. The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Laboratoire Evolution, Génomes et Spéciation LEGS, UPR 9034, CNRS, Avenue de la Terrasse, Bâtiment 13, BP1, 91198, Gif-sur-Yvette, France
A. Marchant, F. Mougel & M. Harry
Université Paris Sud, Orsay, France
A. Marchant, F. Mougel & M. Harry
Departamento de Ciências Biológicas, Faculdade de Ciências Farmacêuticas, UNESP, Araraquara, SP, Brazil
C. Almeida
INRA, UMR 1392, Institut d’Ecologie et des Sciences de l’Environnement de Paris, Route de Saint Cyr, 78026, Versailles, France
E. Jacquin-Joly
Laboratório de Biodiversidade Entomológica, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
J. Costa

Authors

A. Marchant
View author publications
You can also search for this author in PubMed Google Scholar
F. Mougel
View author publications
You can also search for this author in PubMed Google Scholar
C. Almeida
View author publications
You can also search for this author in PubMed Google Scholar
E. Jacquin-Joly
View author publications
You can also search for this author in PubMed Google Scholar
J. Costa
View author publications
You can also search for this author in PubMed Google Scholar
M. Harry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to A. Marchant or M. Harry.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (TAR 329 kb)

Supplementary material 2 (TAR 282 kb)

Supplementary material 3 (TAR 266 kb)

Supplementary material 4 (TIFF 13369 kb)

Supplementary material 5 (DOCX 12 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marchant, A., Mougel, F., Almeida, C. et al. De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease. Genetica 143, 225–239 (2015). https://doi.org/10.1007/s10709-014-9790-5

Download citation

Received: 30 April 2014
Accepted: 01 September 2014
Published: 19 September 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s10709-014-9790-5

De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease

Abstract

Similar content being viewed by others

Introduction

Materials and methods

Sampling

RNA extraction and sequencing

Preprocessing

Assembly tools

Workflow

Illumina data compilation for assembly

Correction of 454

Indices of assembly quality

General annotation

Chimeric contig validation

SNP

Results

Contig number, total length and RMBT

N50 and length distribution

Annotation and completeness

Number of potentially chimeric contigs

Chimeric contigs coverage

SNP

Characteristics of the best assembly

Discussion

Influence of assembly and workflow software

Completeness and annotation of the transcriptome

Chimeric contigs

Conclusion

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding authors

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation