Background & Summary

Phlebotomine sand flies (family Psychodidae, order Diptera) include several genera of hematophagous arthropods that vector important emerging and re-emerging infectious diseases. They transmit bacterial, viral, and, most notably, the protozoan pathogen Leishmania, to humans and animals. Leishmaniasis is a group of diseases that range in clinical manifestation, from self-healing cutaneous lesions to disfiguring mucocutaneous ulcers to fatal visceral disease. Clinical tropisms can be highly dependent on infective species and vectoring sand fly. Over 90 species of sand flies found across Latin America, Africa, the eastern Mediterranean, Southeast Asia, and Europe have been implicated as vectors for approximately 20 species of Leishmania parasites that cause leishmaniasis1,2.

Phlebotomus papatasi vectors Leishmania major, an etiological agent of cutaneous leishmaniasis, across North Africa, the Middle East, and the Indian subcontinent3. It is a restrictive vector in that it can only transmit a single Leishmania species, Le. major. However, P. papatasi also transmits viral febrile illnesses across its distribution4,5. Lutzomyia longipalpis is the major vector responsible for transmission of the visceral leishmaniasis causing parasite, Leishmania infantum, in the Americas6. Lu. longipalpis is a permissive vector in the laboratory, transmitting several Leishmania species, however in nature it only transmits Le. infantum7. Lu. longipalpis has a wide geographic distribution inhabiting a range of diverse ecological habitats and has garnered interest as a species complex. Others have observed differences in spot numbers, pheromones, mating songs, and noted reproductive isolation between different populations collected throughout Brazil8. Leishmaniasis pathogenesis is thought to be dependent on complex host, vector, and parasite interactions and, although the epidemiological implications of a Lu. longipalpis species complex remain unclear, understanding the molecular underpinnings that that lead to vector competence, reproductive isolation and adaptation is critical from an epidemiological and disease control perspective.

In mosquito research, high-quality reference genomes have enabled inquiries into population genetics and metagenomics, identification of gene markers of senescence, vector competence, insecticide resistance, and experimental gene drive approaches to vector control. These have ultimately improved understanding and management of the vector in the disease transmission cycle9. Unfortunately, the fragmented nature of current sand fly references slowed similar inquiries for Leishmania transmission.

Previous reference genomes for P. papatasi and Lu. longipalpis10 suffered very low contiguity. Using the best sequencing technology at the time, read lengths were limited to ~400 bp - too short to span many repeats. More damaging to assembly contiguity, previous library protocol DNA input minimums required DNA to be pooled from many individuals, inserting many different haplotypes into the assembly algorithm. Genome heterozygosity could not be controlled for by inbreeding in sand flies, and haplotype sequence variation – for example, a short insertion polymorphism – caused assembly tools designed for a single haplotype to create sequence gaps in areas of uncertainty. Together, these constraints led the genome assemblies for P. papatasi and Lu. longipalpis to be the 2nd and 3rd worst available in VectorBase11, with contig N50 lengths at 5,795 bp and 7,481 bp, respectively. For reference, across all genomes in VectorBase at the time, the median assembly contig N50 was 51,691 bp. Additionally, no Hi-C or chromosome scale data was available, and these fragmented genome assemblies were inadequate for many genome analyses.

Here, we update these two important sand fly vector genome references leveraging a decade’s worth of technological advances. Specifically, very high quality long read sequences of Q20 or even Q30 are available in lengths longer than the previous assemblies contigs. Second, Hi-C technologies have become de rigueur and have higher chromosomal completion rates when paired with the significantly longer contigs generated by high quality long read assembly. Finally, an ultra-low input library protocol developed by Pacific Biosciences12 enabled the sequencing of a single individual sand fly. This greatly simplified assembly of sequence information from only 2 haplotypes derived from a single individual rather than many haplotypes from a pool of individuals. A small compromise, as only 30 ng of genomic DNA can be isolated from a single sand fly male, is the use of whole genome amplification. Together these three techniques have generated the greatly improved reference assemblies we describe here.

Genome Sequence Report

The genomes of P. papatasi and Lu. Longipalpis were each sequenced from a single male from colonies maintained at the University of Notre Dame. The P. papatasi colony was established in the 1970s from the Israeli strain and the Lu. Longipalpis colony was established in 1988 from the Jacobina strain caught from Bahia State, Brazil. P. papatasi sequencing generated 102x coverage and Lu. longipalpis sequencing generated 53x coverage of PacBio HiFi long reads. Additional material from other individuals from the same colonies was used for Hi-C library preparation.

The final P. papatasi assembly has a span of 351.6 Mb, 646 scaffolds, and a scaffold N50 of 111.8 Mb. The final Lu. Longipalpis assembly has a span of 147.8 Mb, 4 scaffolds, and a scaffold N50 of 40.6 Mb (Table 1, Figs. 1 & 2). The updated assemblies improved upon several deficiencies from the previous assemblies (Table 2). Compared to the previous assemblies, contiguity has improved over 100-fold and these larger contigs are placed in a chromosomal context.

Table 1 Genome data and global statistics.
Fig. 1
figure 1

Snail plot summaries of assembly statistics. (a) Lutzomyia longipalpis assembly ASM2433408v1. (b) Phlebotomus papatasi assembly JANPWV01. Both plots were generated using blobtoolkit43.

Fig. 2
figure 2

Blobplots of base coverage against GC proportion. (a) Lutzomyia longipalpis assembly ASM2433408v1. (b) Phlebotomus papatasi assembly JANPWV01 with no-hits filtered out. Both plots were generated using blobtoolkit43.

Table 2 Comparison of old and new assembly statistics.

Two genome annotations are available for each species. The first is a new NCBI RefSeq13 annotation based on not just this assembly but also new long read transcript data generated to support new annotation. Gene numbers derived from this annotation are shown in Table 2 and BUSCO analysis in Table 3. The number of complete single copy insecta single copy orthologs increased by ~10%. That is, an additional 10% of genes that were previously incomplete or missing are now easily accessible in the improved assembly. In addition to this updated annotation resource, we wished to preserve previous annotations, especially user contributed curated annotations, which connect the genome to previously published analyses. To preserve previous annotation information, we utilized the new open-source pipeline Transfer-annotations14 developed by VectorBase engineers to iteratively run Liftoff15 to accurately transfer previous annotations to new VectorBase Apollo browser tracks and generate a downloadable GFF3 annotation file for each species.

Table 3 BUSCO results for two new sandfly references.

Methods

Sample acquisition and nucleic acid extraction

Single males were chosen for sequencing to capture the heterogametic sex chromosomes, and to ensure only high quality long read sequence data from a single diploid genome was presented to the assembly software for facile assembly. A single male adult sand fly was aspirated from each of our P. papatasi and Lu. Longipalpis colonies and frozen at −80 °C. Each specimen was chilled in liquid nitrogen and ground into a fine powder preceding DNA extraction using a modified Puregene® kit extraction protocol (Qiagen, Hilden, Germany). DNA was eluted in 30 μl of TE buffer and concentration was assessed using a Nanodrop Spectrophotometer.

Long read library construction and sequencing

Pacific Biosciences HiFi Libraries were constructed using an ultra-low input library protocol12. The P. papatasi library was prepared at Pacific Biosciences using a pre-production version of the library kit. The Lu. Longipalpis library was prepared at the UC Davis DNA technologies core using the commercially available SMRTbell gDNA Sample Amplification Kit (Pacific Biosciences, Menlo Park, CA; Cat. #101-980-000) and the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences; Cat. #100-938-900) according to the manufacturer’s instructions. Briefly, approximately 10 kb sheared DNA by the Megaruptor 3 system (Diagenode, Belgium; Cat. #B06010003) was used for removal of single-strand overhangs at 37 °C for 15 minutes, DNA damage repair at 37 °C for 30 minutes, end repair and A-tailing at 20 °C for 30 minutes and 65 °C for 30 minutes, and ligation of overhang adapters at 20 °C for 60 minutes. To prepare for library amplification by PCR, the library was purified with ProNex beads (Promega, Madison, WI; Cat. # NG2002) for two PCR amplification conditions at 15 cycles each then another ProNex beads purification. Purified amplified DNA from both reactions were pooled in equal mass quantities for another round of enzymatic steps that included DNA repair, end repair/A-tailing, overhang adapter ligation, and purification with ProNex Beads. The PippinHT system (Sage Science, Beverly, MA; Cat # HPE7510) was used for SMRTbell library size selection to remove fragments <6–10 kb. The 10-11 kb average HiFi SMRTbell library was sequenced using one 8 M SMRT cell, Sequel IIe sequencing chemistry 2.0, and 30-hour movies each on a PacBio Sequel II sequencer.

Long read assembly

The draft Lu. longipalpis genome assembly was assembled using hifiasm16 from HiFi data generated from a single male individual at the UC Davis Genome Core using the ultra-low input protocol. Filtering input reads to have an average quality >Q30 was found to give a more contiguous final assembly for this dataset than Q20 filtered reads and was used for the final assembly. The draft genome assembly for P. papatasi was generated at Pacific Biosciences based on HiFi reads generated at Pacific Biosciences with a library made from a single adult male individual using an ultra-low input library kit. The long-read assembly was performed using HGAP and Falcon17.

3D sequencing and assembly

The high-quality drafts were upgraded to chromosome-length using Hi-C data derived from different male individuals from the same respective colonies at the University of Notre Dame. The in situ Hi-C libraries were generated as described in Rao, Huntley et al.18. Briefly, whole insect bodies were crosslinked with 1% formaldehyde for 10 minutes at room temperature. Nuclei were extracted via grinding and permeabilized using SDS. DNA was digested with a cocktail of Csp6I and MseI, and the ends of restriction fragments were labeled using biotinylated nucleotides then ligated. After reversal of crosslinks, ligated DNA was purified and sheared to a length of ~400 bp, at which point ligation junctions were pulled down with streptavidin beads and prepped for Illumina sequencing. The resulting libraries were sequenced using Illumina NovaSeq 6000 instruments. Hi-C data were aligned to the draft references using Juicer19, and 3D assembly for both species was performed using 3D-DNA pipeline20. In view of the large number of alternative haplotypes incorporated in the draft assembly as separate sequences21, 3D-DNA pipeline was run with the “merge” step option for Lu. longipalpis (see Matthews et al.22) to remove alt haplotypes from the anchored portion of the assembly. The resulting assemblies were reviewed and curated using Juicebox Assembly Tools23. The resulting contact maps (Fig. 3) can be explored interactively at multiple resolutions via Juicebox.js24 at the DNA Zoo website pages25,26.

Fig. 3
figure 3

Hi-C contact maps. (a) Lutzomyia longipalpis (b) Phlebotomus papatasi. Chromosome-length Hi-C contact maps visualized in Juicebox44.

Removal of non-chromosomal sequences from Lu. longipalpis

During BUSCO analysis the Lu. longipalpis draft assembly contained high numbers of duplicate BUSCO genes. This was due to the presence of alternative haplotype sequences in the unanchored portion of the assemblies. As expected, removing unanchored sequences during annotation greatly reduced the duplicates.

Gene annotation lift-over

We used the pipeline Transfer-Annotations14 and the program Liftoff15 to move previous gene annotations and manual curations to the new reference assembly. Liftoff distance and flank parameters were determined by incrementally changing them to find the combination with the lowest flank number and the fewest missing features. We used agat_sp_fix_cds_phases27 to calculate phase information and identify any transferred gene models that are incomplete or altered. AGAT’s agat_sp_extract_sequences27 was used to extract CDS protein sequences for the transferred genes on the new genome. The Transfer-annotations pipeline then identifies missing CDS regions, and it produces a corrected GFF3 with metadata regarding model validity in the GFF3 attributes column. This process includes if a protein sequence contains stop codons, if it matches the original sequence, or if it has any missing CDS regions. Transfers were considered invalid if the coding sequence had a missing CDS region or internal stop codon, or ncRNA sequences did not match between the source and transfer sequences. Coding sequences with mismatched protein sequences were not considered invalid and are flagged for future examination.

A final GFF3 of the transferred annotation is available at VectorBase as an Apollo genome browser track color coded by estimated transfer quality. A majority of genes transferred from each original source genome to its replacement assembly (Table 4). However, 30.3% and 22.0% were invalidated by missing CDS regions and internal stop codons, and 73.2% and 62.8% of CDS had mismatched protein sequences. That not all annotations could be transferred is likely unavoidable due to the differences in genome quality.

Table 4 Transfer summaries for Lu longipalpis and P. papatasi.

Data Records

Lutzomyia longipalpis PacBio HiFi28 long reads and final assembly29 are available at the NCBI with BioProject accession number PRJNA84927430. Lutzomyia longipalpis HiC short reads are available at the NCBI SRA31 with BioProject accession number PRJNA51290732. Phlebotomus papatasi PacBio HiFi long reads33 and final assembly34 are available at the NCBI with BioProject accession numbers PRJNA65724535 and PRJNA85845236 respectively. Phlebotomus papatasi HiC short reads are available at the NCBI SRA37 with BioProject accession number PRJNA51290732. Additional sub-accessions are shown in Table 1.

Technical Validation

One of our aims was for these new genome references to meet the Earth BioGenome Project standards38 despite the small amounts of input materials. Specifically, we aimed to have >1 Mb contig N50, and achieved full chromosome lengths using Hi-C data.

We assessed reference gene model completeness using BUSCO39 (V3.0.2). For both sandfly references the diptera_odb10 set of 2,910 single copy orthologs are missing 225 (6.8%) of the genes (Table 3). This number decreases when the analysis is performed on larger taxonomic groups with smaller BUSCO gene sets. For example, only ~3% of genes (P. papatasi (42) and Lu. longipalpis (29)) are missing from the 1,367 insecta_odb10 BUSCO gene set. Whilst this is a vast improvement on the previous assemblies, future work is required to determine which missing genes are due to assembly problems such as gaps between 1 Mb N50 contigs or genuine gene loss during >150 million years of divergence time between these species and others in the orthoDB database at the current time40,41.

While assessing base coverage and GC content for P. papatasi, we noticed a blob that stood out from the rest of the Arthropoda hits, with several-fold less base coverage (accession #: CM045756.1). Hits for this “blob” included families Culicidae, Curculionidae, formicidae, Kalotermitidae, Noctuidae, and Drosophilidae. To assess for contamination, we blasted these regions against the NCBI nucleotide database. The top hits returned P. papatasi. To investigate the possibility of a sex chromosome, we blasted Y chromosome-linked scaffolds in Lu. longipalpis identified by Vigoder et al. against the NCBI nucleotide database42. While there were several P. papatasi hits, none were localized to this blob. Interestingly, other hits included the X chromosome for several different species of flies, three of which have an XY sex chromosome system. Finally, we blasted our blob of interest against the Drosophila Y chromosome (NC_024512.1). There was no significant similarity found.