The Draft Genome of the MD-2 Pineapple

Redwan, Raimi M.; Saidin, Akzam; Kumar, Subbiah V.

doi:10.1007/978-3-030-00614-3_9

Raimi M. Redwan^3,4,
Akzam Saidin⁵ &
Subbiah V. Kumar⁴

Part of the book series: Plant Genetics and Genomics: Crops and Models ((PGG,volume 22))

845 Accesses

Abstract

With the advancement in sequencing technology, it is now possible to decode complex plant genomes with high accuracy. For many years, short-read sequencers were the dominant reads used for assembling genomes until the introduction of third-generation long-read sequencing machines. Long reads are able to extend through complex regions of repeats avoiding erroneous collapse which causes a reduction in the genome assembly size. However, the low accuracy of the long reads is a cause of concern, and this hinders its direct application in de novo assemblies of large genomes. Here, we report on the whole-genome assembly of the MD-2 pineapple using a hybrid sequencing approach. We used the Illumina short reads to correct the systematic errors of the long PacBio reads. The error-corrected long reads were then used to de novo assemble the MD-2 pineapple genome using multiple assembly software and strategies. The most optimal accuracy and contiguity were achieved in the de novo assembly of error-corrected long reads using Celera. The MD-2 pineapple genome achieved a N50 of 153,084 bp with 8448 scaffolds and a total assembly size of 524.07 Mb. In addition, 245 out of the 248 ultra-conserved CEGs were found in the genome, indicating completeness of more than 98%. Furthermore, 87% of the mapped transcripts were identified in the genome with coverages of more than 90%, while another 12% were mapped with coverages of more than 80%. This MD-2 pineapple genome provides a high-quality draft for gene prediction and further downstream applications in pineapple.

Access provided by CONRICYT-eBooks. Download chapter PDF

The draft genome sequence of cork oak

Article Open access 22 May 2018

Two long read-based genome assembly and annotation of polyploidy woody plants, Hibiscus syriacus L. using PacBio and Nanopore platforms

Article Open access 18 October 2023

Strategies for optimizing BioNano and Dovetail explored through a second reference quality assembly for the legume model, Medicago truncatula

Article Open access 04 August 2017

Keywords

Introduction

The main challenge in assembling plant genome is its ploidy level, repeats content, and polymorphism. The second-generation sequencing delivered the throughput and the accuracy that is crucial to whole-genome sequencing but insufficient and remained challenging for some plant species. It is known that genomes produced by next-generation sequencing produced small contigs that would inflate the number of annotated genes (Varshney et al. 2011) and missed on the transposable elements that are abundant in plant genome due to their repetitive nature (Michael and Jackson 2013).

In assembling plant genomes, many reported the unresolved part of the genome, that is, the heterochromatin region that was left unassembled in the final draft (Cheung et al. 2006; Tuskan et al. 2006; Ming et al. 2008; Wang et al. 2012a, b, 2014). This region is tightly packed in the centric and subtelomeric regions of the chromosome, and is highly repetitive, making the sequences difficult for sequencing and assembly (Hoskins et al. 2002). However, the complexity of the regions does not make the region any less important to be decoded as the regions also contained genes and important regulatory elements for euchromatic genes (He et al. 2012). The task to resolve the heterochromatic region in whole-genome sequencing project especially the one using shotgun strategies was only performed as a subsequent improvement of the genome draft using concise physical mapping for targeted transposons resequencing (Devine et al. 1997; Hoskins et al. 2002). This sort of information may not be available for non-model plants, and the intrinsic solution to improve the resolution of repetitive reads of the heterochromatic region is longer reads that can span through the elements.

The use of long reads from the third-generation sequencing is not directly useful neither to the feasibility of complete de novo whole-genome sequencing. High accuracy reads of 99.99% of PacBio reads can only be achieved as consensus reads, for the random errors to be resolved by consensus calling. At single pass, PacBio reads contain high error rate, and due to this independent use of the reads requires error correction. This is because errors in reads will cause failure for the assembler to establish overlap-layout path between reads in order to merge them. Error correction can be performed either by using the PacBio reads itself or by adopting the high accuracy reads from the second-generation sequencing. Self-correction module of PacBio reads required redundant coverage of at least 50 of the targeted genome to generate an accurate consensus (Chin et al. 2013) and for pineapple whole-genome sequencing which has estimated genome size of 526 Mb, this is translated to 26.3 Gb of data, in equality of 58 sequencing SMRT cells at output of 450 Mb per cell.

In addition, the cost for PacBio sequencing data per base pair was not cheap as compared to second-generation sequencing. It is preferable that the long reads performed self-error correction in order to eliminate transmission of inherent error profile from another sequencing platform and to reduce length trimming due to lack of reads coverage from the other reads pool (that may suffer sequencing bias). The strategy may be the best options for any future de novo sequencing of genome, but at its current price, generating 50-fold coverage for large size eukaryotic genome can be difficult for many researchers, especially in developing countries.

Nevertheless, the potential of PacBio long reads to finish assembly of genome into finished, single contig by shotgun sequencing is undisputable and has been proven (Koren et al. 2012a; Chin et al. 2013; Huddleston et al. 2014). But all these were limited only on bacterial genome with size range of 2–6 Mb, which enable deep sequencing with just few SMRT cells run on PacBio platform. For complex plant genome, this would require many SMRT cells to achieve sufficient coverage. Alternative to this is by using hybrid sequencing technology to borrow the high accuracy from the second-generation sequencing technology in improving the long reads of PacBio. In addition, many sequencing genome projects have started using the second-generation sequencing. This data could not possibly be wasted and should be utilized for what it is best for, and that is the accuracy. Recently, the method has deemed successful with complete assembly of several genomes (Koren et al. 2012b; Ribeiro et al. 2012; Pendleton et al. 2015) and to a lesser extent to improve the contiguity of complex genome such as orchid (Yan et al. 2015).

In the motivation to sequence the pineapple genome, the main challenge relies on its heterozygosity and recalcitrant to self-pollinate. The innate parthenocarpic nature of the plant prevents the development of in-breed lines to facilitate its sequencing project. The presence of high number of multi-alleles in the genome complicates the assembly process especially at the contigging process as it caused the formation of “bubble” structures due to the mismatch. In the assembly of pineapple genome of hybrid F153, the problem of heterozygosity is reduced by using the haplotype phasing methods to eliminate one of the haploid copies to reduce the complexity of the assembly (Ming et al. 2015).

In the assembly of the commercially important MD-2 pineapple, long sequencing read technology is used to tackle the problem of repetitive and complex multi-allelic regions of the genomes. However, due to the high random error that is innate at low coverage of the PacBio long reads, the sequence reads demand accuracy improvements prior to its direct use in whole-genome sequencing assembly. The approach used in this project is to combine the two leading-edge sequencers (i.e., Illumina and PacBio) in a hybrid assembly to construct a draft for MD-2 pineapple genome.

Three different strategies were tested to find the most optimal pipeline that can produce an assembly that is complete as defined by the assembly size, accurate as defined by the content of gene predicted, and contiguous as defined by the scaffold size and N50. In the first strategy, de novo assembly of short-insert reads was improved by using PBJelly to perform gap-filling and scaffolding by applying the PacBio sub-reads (i.e., uncorrected). Secondly, the contigs from the short reads assembly were used as anchor in assembling the uncorrected PacBio long reads using the newly developed DBG2OLC software (Ye et al. 2016). Finally, following error correction of the PacBio long reads by using the Illumina short reads through novoLR package (Hercus 2015), the error-corrected PacBio reads were de novo assembled using traditional overlap-layout-based assembler, Celera (Myers et al. 2000). Assemblies from the three strategies were then selected based on the basic assembly metrics, the number of pineapple’s transcripts mapped to the genome, and the number of core eukaryotic gene found in the genome through assessment using CEGMA.

Sample Materials

The MD-2 pineapple was obtained from Malaysia Pineapple Industry Board and was maintained at Biotechnology Research Institute, UMS, for pineapple laboratory work. In this study, all genomic DNA extraction was performed on the pineapple leaves from a single plant.

De Bruijn-Based Assembly Using Only Short Reads

In finding the most optimal assemblers for the high-heterozygous genome of pineapple, three different assemblers were chosen based on its known credibility and specialty to handle complex genome. The result of the quality assessment for the three assemblies by using Assemblathon (Earl et al. 2011) was tabulated in Table 9.1.

Table 9.1 Summary of assembly metrics across three different pineapple draft genomes produced using the respective assembly software

The Draft Genome of the MD-2 Pineapple

Abstract

Similar content being viewed by others

The draft genome sequence of cork oak

Two long read-based genome assembly and annotation of polyploidy woody plants, Hibiscus syriacus L. using PacBio and Nanopore platforms

Strategies for optimizing BioNano and Dovetail explored through a second reference quality assembly for the legume model, Medicago truncatula

Keywords

Introduction

Sample Materials

De Bruijn-Based Assembly Using Only Short Reads

De Novo Assembly of Error-Corrected Long Reads by Mapping

De Novo Assembly of Error-Corrected Long Reads

Draft of MD-2 Pineapple Genome

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation