Introduction

Brassica oleracea is one of the most economically important Brassica species cultivated worldwide, mainly as a vegetable crop that includes cabbage, broccoli, and cauliflower subspecies/morphotypes (Kopsell and Kopsell 2006; Liu et al. 2014). B. oleracea also provided the ancestor genome of Brassica napus and Brassica carinata, both of which are cultivated as important oil crops. A high-quality genome assembly has been pursued for the improvement of the genetics and breeding in these crops. The whole-genome sequences of JZS (B. oleracea sp. capitata, heading type) and TO1000 (B. oleracea ssp. Alboglabrata, kale-like type), which were assembled using the next-generation sequencing data, were released in 2014 (Liu et al. 2014; Parkin et al. 2014). While NGS technology provided high-accuracy and high-throughput reads to assemble the genome at relatively low cost, the methods had several limitations, especially the short sequencing length that directly led to low assembly integrity, low sequence continuity, and a large number of gaps and assembly errors in the JZS current genome. Single-molecule real-time (SMRT) sequencing developed by Pacific BioSciences (PacBio) could offer longer sequencing reads, and these long sequencing reads could greatly improve the integrity of the genome assembly, especially the assembly of repeat regions (Jiao and Schneeberger 2017; Rhoads and Au 2015). In 2018, a high-quality HDEM (B. oleracea ssp. botrytis italica, broccoli type) genome assembled by long reads was released; both sequence continuity and assembly integrity had been greatly improved, in particular the sequences of transposable elements (Belser et al. 2018). The three B. oleracea accessions JZS (Jinzaosheng), HDEM and TO1000 used here belong to three different crops. JZS is a heading cabbage belonging to B. oleracea ssp. Capitata, HDEM is a broccoli belonging to B. oleracea ssp. botrytis italica, and TO1000 was derived from Chinese kale belonging to B. oleracea ssp. alboglabra.

Chromosome conformation capture (Hi-C) is a sequencing-based approach for determining genome 3D organization that has been used to anchor scaffolds to chromosomes in the Brassica genome (Dudchenko et al. 2017; Wang et al. 2019; Zhang et al. 2018). Efficient Hi-C pipelines of anchoring scaffolds to chromosomes was an important factor in the rapid application of Hi-C technologies to plant genome assembly. Nowadays, the main software programs for anchoring scaffolds to chromosomes by Hi-C data are Lachesis, 3D-DNA and ALLHiC (Burton et al. 2013; Dudchenko et al. 2017; Zhang et al. 2019). Lachesis extracts the interaction information of the Hi-C reads alignments and then clusters the scaffolds and divides them into different groups, finally sorting and updating the scaffolds to the chromosome level. 3D-DNA not only has the functions possessed by the Lachesis software but also automatically performs the scaffolds error correction based on the interaction information and handles some misjoins introduced in the preliminary assembly. These misjoins are formed by two distant scaffolds being assembled together (Fig. S1), and these misjoins can directly affect the construction of pseudomolecules. ALLHiC is mainly used to phase and scaffold polyploid genomes based on Hi-C data. Currently, although some software can automatically correct misjoins in scaffolds (i.e., SALSA and 3D-DNA) (Dudchenko et al. 2017; Ghurye et al. 2017), these pipelines often determine breakpoints within gene bodies and generate many false-positive breakpoints.

More complete assembly allows the identification of new features of transposable elements (TEs). As a major group of plant TEs, long terminal repeat retrotransposons (LTR-RTs) are an important component of plant genomes, and the content varies greatly among different species (Morgante et al. 2005). After updating the B. rapa genome with PacBio technology, a more complete TE sequence was obtained that enabled detection of a new round of LTR-RT burst events in the B. rapa genome (Zhang et al. 2018). However, a small content and only a single round of LTR-RT burst was detected in the B. oleracea JZS reference genome (Liu et al. 2014). From previous reports, amplification of LTR-RTs could change the genomic size and structure as well as regulate the diversification of different species (Kim et al. 2017; Naito et al. 2006; Zhou et al. 2017). Lineage-specific LTR-RT bursts could rapidly change genome size, and the distinct amplification patterns of different transposable element families led to diversification of different species (Ammiraju et al. 2007; Du et al. 2010; Hawkins et al. 2006; Piegu et al. 2006; Vitte et al. 2007). However, we know little about how LTR-RTs regulate genome diversification among Brassica subspecies. In B. rapa, an LTR-RT insertion in the FLOWERING LOCUS T was associated with delayed flowering which directly regulated plant phenotype (Zhang et al. 2015). With PacBio technology, more complete LTR-RT sequences could be obtained, making it possible to comprehensively study LTR-RTs.

Here, we reported a new version of B. olerecea genome assembly (JZS v2) with a high level of assembly integrity and sequence continuity. The methods involved single-molecule sequencing and Hi-C technology. We assembled an additional 83.16 Mb of genomic sequences, and the updated genome features a contig N50 size of 2.37 Mb, representing an ~ 88-fold improvement. We corrected large mis-assemblies in the previous assembly and detected a new round of LTR-RT burst in the new assembly. Meanwhile, we developed a reliable misjoins correction pipeline based on Hi-C data. Then, by comparative analysis with the other two reported assemblies of B. oleracea (TO1000 and HDEM), we identified extensive gene order and gene structural variations. Meanwhile, we investigated LTR-RTs in the three genomes and found that the subspecies genome-specific amplification of Gypsy-like LTR-RTs occurred less than 1 (mostly 0.4) million years ago (MYA). In particular, the athila, tat, and Del families were highly amplified in JZS in the last 1 MYA. We further found that gene modifications had occurred by the genome-specific insertion of LTR-RTs, and these modified genes were enriched in cells, cell parts and endomembrane system functions. Our analyses unveiled extensive gene structural variations among B. oleracea subspecies genomes and showed that the genome-specific LTR-RT dynamics were associated with genome diversification in B. oleracea.

Results

Genome sequencing and assembly

We sequenced and de novo assembled the JZS genome through a combination of three technologies (Illumina, PacBio and Hi-C). We used 38.34 Gb (~ 61 ×) single-molecule real-time (SMRT) sequencing reads and 53.16 Gb (~ 84 ×) paired-end sequencing reads (Table S1) to assemble the JZS genome, resulting in a 561 Mb assembly with a contig N50 size of 2.37 Mb (Table 1). Compared to the JZS v1 assembly, we assembled an additional 83.16 Mb of genomic sequences (Table 1).

Table 1 Assembly statistics of JZS v2 and other B. oleracea assemblies

We developed a reliable pipeline to automatically correct misjoins in the original scaffolds. For construction of pseudomolecules, we first used 3D-DNA to correct misjoins in the original scaffolds. The analysis yielded a total of 1,018 breakpoints, of which 137 were inside the gene bodies. To improve integrity of predicted genes and reduce the number of false positive breakpoints, we developed a more reliable error correction pipeline that did not rely on the 3D-DNA misjoin correction module. This pipeline would not fragment the previously annotated gene sequences. After correction, we detected 70 reliable breakpoints, and the assembly contained 1,184 contigs with a contig N50 size of 2.37 Mb (Table S2). The results showed that error correction by the newly developed pipeline significantly improved the construction of pseudomolecules (Fig. S2).

The quality of the JZS v2 genome was evaluated using four methods. First, 96.17% of the assembled sequences were anchored to nine chromosomes, which significantly improved the ratio of 74.71% in the JZS v1 assembly (Table 1). Second, approximately 97.1% of embryophyte genes were detected in the JZS v2 assembly by BUSCO (Waterhouse et al. 2018), similar to those in other published B. oleracea genomes (97.2% for JZS v1, 97.0% for TO1000, and 96.9% for HDEM) (Table S3), indicating the near-complete genome of JZS v2. Third, we further used LTR Assembly Index (Ou et al. 2018) to evaluate the genome continuity. The LAI value of JZS v2 was increased from 4.13 for JZS v1 to 10.13, indicating the high quality of repeat sequences. Fourth, comparison of the high-quality genome assembly of HDEM and whole genome Hi-C contact map indicated that large mis-assemblies existing in the JZS v1 assembly were extensively corrected in JZS v2 assembly (Fig. 1, Fig. S3).

Fig. 1
figure 1

Comparison between assemblies of JZS v1 and JZS v2. a Dot-plot alignment of the JZS v2 with JZS v1. b, c Whole genome contacts of Hi-C data of JZS v1 b and JZS v2 genomes (c)

Genome annotation

In the JZS v2 genome, 48.06% (269.66 Mb) of the assembly sequences were annotated as repetitive elements, approximately 47 Mb longer than that in the JZS v1 assembly (Fig. 2, Table S4). The most abundant repetitive sequence type was the LTR-RT. As a predominant group of plant TEs, LTR-RTs include two main suprfamilies, Gypsy-like and Copia-like LTR-RTs. The Gypsy-like and Copia-like LTR-RT represented approximately 11.31% (63.47 Mb) and 10.13% (56.83 Mb) of the JZS v2 genome sequences, and 21 Mb and 18.83 Mb longer in length than correspondingly length in the JZS v1 genome, respectively. Compared to the other two B. oleracea subspecies assemblies, there were differences in the composition of TEs (the repeat sequences of JZS v2, HDEM and TO1000 assemblies were 269.66 Mb, 255.31 Mb and 189.37 Mb, respectively) (Table S4).

Fig. 2
figure 2

Genomic landscape between B. oleracea JZS v1 and JZS v2 assemblies. a, b Gene and transposable-element density in sliding windows of 500 kb and step size 50 kb. c Syntenic gene pairs between JZS v1 and JZS v2 assemblies

In total, 59,064 protein-coding genes were predicted in the JZS v2 genome by using a pipeline combining ab initio, protein-homology-based and RNA-seq-based gene prediction (Table 2). Approximately 97.5% of the embryophyte genes were detected in the JZS v2 gene dataset according to BUSCO, and the ratio was higher than in the previously reported B. oleracea annotated gene datasets (Table S5). Compared to the annotation of JZS v1, the number of genes in the JZS v2 genome was 13,306 more than in the JZS v1 annotated dataset, and the extra genes in JZS v2 were mainly WGD genes (Table S6). Then, we aligned gene sequences (include intron) in JZS v1 to JZS v2 assembly; the results indicated that more than 95% gene sequences in JZS v1 were well aligned to the JZS v2 assembly (Identity >  = 0.95; Coverage >  = 0.95). Similarly, nearly 90% of the gene sequences in JZS v2 were well aligned to JZS v1 assembly (Table S7). Further analysis revealed that nearly 90% of the annotated genes in the JZS v1 genome were syntenic with the JZS v2 genome, and nearly 80% of the non-syntenic genes in JZS v2 were supported by homologous evidence in other Brassicaceae species (Table S8 and S9), indicating the reliability of our annotation. In addition, the number of annotated genes in the JZS genome was less than that of HDEM genome. This might be due to the inclusion of some low-quality annotated genes in published B. oleracea genomes, such as genes without start codons, genes ending without a stop codon, and gene lengths less than 50 bp etc. In the JZS v2, JZS v1, TO1000, and HDEM genomes, the low-quality genes accounted for 0% (0 of 59,064), 0.91% (416 of 45,758), 9.28% (5498 of 59,225), and 17.87% (10,950 of 61,279), respectively (Table S10).

Table 2 Statistics of predicted genes among B. oleracea assemblies

Construction of JZS v2 sub-genomes and genome blocks

We reconstructed from JZS v2 three sub-genomes using the gene syntenic relationships to Arabidopsis thaliana (Table S11). There were 54,120 annotated genes in the three sub-genomes, and the LF sub-genome maintained more gene copies than the other two sub-genomes (Fig. S4). We calculated the syntenic relationship of JZS v2 and JZS v1 with A. thalina; the analysis showed that the continuity of the syntenic fragments in JZS v2 was higher than that of JZS v1, especially on chromosomes C04, C05, and C06 (Fig. S5). This indicated that JZS v2 could provide more complete sub-genomes and genome blocks. Then, we defined genome blocks in the JZS v2 genome based on sub-genomes information (Fig. S6, Table S12). In addition, we investigated the location of the centromeres of nine chromosomes in the JZS v2 genome. In total, we detected 16.17 Mb centromere sequences on nine chromosomes, and gene density in these regions was 29.00 genes per Mb, which was much lower than the gene density of the whole genome (105.25 genes per Mb) (Table S13).

Extensive gene order and gene structural variations among B. oleracea genomes

Extensive gene order and gene structural variation was detected among the three B. oleracea genomes. On the basis of the coding sequences of the 15,422 single-copy orthologous genes, we constructed the phylogeny for the three genomes with B. rapa as the outgroup (Fig. 3a). Then, we calculated syntenic genes of the three genomes. Approximately 83.97% (49,595 of 59,064), 83.44% (51,134 of 61,279) and 82.96% (49,132 of 59,225) syntenic genes were detected in JZS v2, HDEM and TO1000 assemblies (Table S14). Then, we calculated syntenic genes among JZS v2, TO1000, HDEM and A. thaliana; there were 2,272 genes that being lost in the JZS v2 genome but presenting in the HDEM or TO1000 genomes. Meanwhile, there were 1,983 and 2,874 syntenic genes being lost in the HDEM and TO1000 genomes (Fig. 3b, Table S15). Furthermore, we investigated variations within genes between JZS v2 and other two B. oleracea genomes; approximately 20.18% and 17.83% of the genes contained large-effect mutations such as gain or loss of stop codons, splice-donor or splice-acceptor or other major protein difference variations in HDEM and TO1000, respectively (Fig. 3c, Table S16). Also, we calculated syntenic genes similarity between two of the three genomes. The results showed that 27.94%, 27.57%, and 18.93% of pair genes in JZS v2 with HDEM, JZS v2 with TO1000, and HDEM with TO1000 genomes, respectively, have considerable structural variation (the similarity of paired protein sequences of syntenic genes less than 90%) (Fig. S7). Presence/absence variations (PAVs) were used to describe sequences that were present in some genomes but absent in others (Springer et al. 2009); we used a previously reported method (Sun et al. 2018) to detect PAV sequences in each of the three B. oleracea subspecies genomes. We identified 5,270 JZS v2 specific genomic segments (5.00 Mb in total), 6,438 HDEM specific genomic segments (7.11 Mb in total), and 5,307 TO1000 specific genomic segments (4.78 Mb in total). Most of the PAV segments were very short (0–1 kb), and very few PAV segments were longer than 5 kb (Fig. 3d and Table S17).

Fig. 3
figure 3

Gene structural variations among JZS v2, TO1000, and HDEM genomes. a Phylogeny for JZS v2, TO1000, and HDEM genomes. The phylogenetic tree was constructed on the basis of 15,422 single-copy orthologous genes with Chiifu as outgroup. b Comparisons of syntenic genes among JZS v2, TO1000 and HDEM genomes. To ensure that the detected syntenic genes were reliable, A. thaliana was also included to calculate synteny. c Variations within genes between JZS v2 and two other B. oleracea genomes. d Length distribution of PAV segments between JZS v2, HDEM, and TO1000 genomes

Different changing patterns of LTR-RT among B. oleracea genomes

We detected 9,755 (62.82 Mb in total), 2,648 (14.44 Mb in total), 1,706 (8.91 Mb in total), and 7,149 (47.84 Mb in total) intact LTR-RTs in JZS v2, JZS v1, TO1000, and HDEM genomes, respectively (Table S18). Compared with JZS v1, the 7,107 (48.38 Mb in total) extra intact LTR-RTs were specifically assembled in JZS v2 (Table S18, Fig. S8). We calculated the insertion times of all of the intact LTR-RTs in the JZS v2 genome and found a new round of LTR-RT burst event in the new assembly, mainly due to the recent (0–1 MYA) large-scale expansion of LTR-RTs. Approximately 52% of JZS v2 intact LTR-RTs were formed around 0–1 MYA, whereas only approximately 30% of intact LTR-RTs in the JZS v1 genome corresponded to this time (Fig. 4).

Fig. 4
figure 4

Insertion time of all intact LTR-RTs in JZS v2 and JZS v1 assemblies

The changing patterns of LTR-RT around 0–1 (mostly 0–0.4) MYA were different between the JZS v2 and HDEM genomes (Fig. 5). Among the detected intact LTR-RTs, the number of those formed during 0–1 and 0–0.4 MYA in the JZS v2 genome was 2.20 and 3.78 times more than that of HDEM genome (5,075 and 2,310 LTR-RTs in JZS v2 and HDEM genomes during 0–1 MYA, 2,948 and 780 LTR-RTs in JZS v2 and HDEM genomes during 0–0.4 MYA). The rate of Gypsy-like LTR-RT amplification in the JZS v2 genome displayed a pattern of continuous increase, that was absent in HDEM (Fig. 5). During the last one million years, the number of Gypsy-like LTR-RTs in JZS v2 was 2.13 times that of the HDEM genome; however, it was 4.12 times during 0–0.4 MYA. Furthermore, the recent (0–1 MYA) amplification patterns of athila, tat, and Del families, members of the gypsy superfamily, were consistent with the amplification patterns of Gypsy-like LTR-RTs in the JZS v2 and HDEM genomes. In the JZS genome, the numbers of athila, tat and Del families amplified in the last 1 million years were 3.58, 5.70, and 4.08 times that of the HDEM genome, respectively (Fig. 5, Table S19, S20). These results revealed that the striking difference in the recent Gypsy-like LTR-RTs expansion might contribute to B. oleracea genome diversification.

Fig. 5
figure 5

Insertion time of LTR-RT in JZS v2 and HDEM genomes. The left and right panels depict the predicted insertion time of LTR-RT (all, Gypsy-like and Copia-like LTR-RTs) and three specific families (athila, tat, del) of the Gypsy superfamily

The insertion of genome-specific LTR-RT was related to syntenic gene modification

The insertion of genome-specific LTR-RTs was closely related to syntenic gene modification, which might have contributed to B. oleracea genome diversification. In this work, syntenic gene modification referred to a syntenic gene that was modified due to the insertion of genome-specific LTR-RTs (Fig. 6a, Table S21). In total, we detected the insertion of 79, 38, and 26 genome-specific LTR-RTs in the JZS v2, HDEM, and TO1000 genomes, respectively, that were closely related to 127 modified syntenic genes in the three genomes (Table S22). First, we identified syntenic gene pairs between A. thaliana and JZS v2, TO1000 and HDEM genomes and obtained a syntenic gene list. It was found that 2,272, 1,983 and 2,874 syntenic genes could not be detected in the JZS v2, HDEM, and TO1000 genomes, respectively (Fig. 3b, Table S13). Interestingly, we found the insertion of genome-specific LTR-RTs were associated with these undetectable genes. In the JZS v2, HDEM and TO1000 genomes, the insertion of 79, 38, and 26 genome-specific LTR-RTs were associated with 71, 38 and 23 modified syntenic genes, and 67 of 71, 34 of 38 and 19 of 23 modified genes were genome-specific modified syntenic genes, respectively (Fig. S9, Table S22). To investigate the functions of the modified gene, we used the corresponding syntenic gene in A. thaliana to represent the modified gene. According to the GO annotation of these modified genes (we used the syntenic genes in A. thaliana as representatives), 125 of the 127 modified genes had 158 GO terms; 34.66% (44 of 127), 33.86% (43 of 127), and 26.77% (34 of 127) genes had GO:0,008,150, GO:0,003,674 and GO:0,005,575, respectively (Table S23). GO enrichment revealed that these modified genes were mainly related to cells, cell parts, and endomembrane system functions (P value < 0.01) (Fig. 6b). These results indicated that the insertion of genome-specific LTR-RTs could be closely related to syntenic gene modifications that might contribute to genome diversification in B. oleracea.

Fig. 6
figure 6

The insertion of genome-specific LTR-RT related to gene modification. a An example of the insertion of a specific Gypsy-like LTR-RT related to gene modification in the JZS genome. b GO enrichment analysis of all modified genes in JZS v2, TO1000, and HDEM genomes. We used the syntenic genes in A. thaliana as representatives to investigate GO terms of modified genes in the three genomes

Discussion

B. oleracea is one of the diploid species in the famous “Brassica U’s triangle” (Nagaharu, 1935), that includes many different subspecies/morphotypes: cabbage, kale, cauliflower, broccoli, kailan, Brussels sprouts, and kohlrabi (Cheng et al. 2016). As a reference genome for the heading type B. oleracea (cabbage), JZS has been widely used for genomic analysis (Liu et al. 2014). Here we reported a greatly improved assembly of the JZS genome, by taking an advantage of our newly developed misjoins correction pipeline in the process of anchoring scaffolds to chromosomes by Hi-C data. Although the whole genome Hi-C contact map revealed high-quality of our assembly, there is a small region on C05 that showed not clear contiguity. Further analysis revealed that it was mainly caused by the centromeric region (Fig. 1, Table S12). The high-quality reference genome of the heading type B. oleracea provided a more complete gene set and TE sequences, thus enabling comparisons of genome structures and making it possible to investigate diversification among subspecies genomes.

The large phenotypic differences among the seven subspecies make it interesting to investigate diversification among B. oleracea. It was reported that large genomic sequence structure variations and genetic variations were associated with subspecies diversification. In the rice genome, an AA-genome-specific inversion (~ 170 kb, bearing 14 orthologous genes) took place ~ 4.4 million years after the split with BB-genome species (Stein et al. 2018). In the maize genome, more than 10% of the annotated genes were nonsyntenic and more than 20% of the predicted genes had either large-effect mutations or large structural variations between B73 and Mo17 genomes (Sun et al. 2018). In our study, we also detected large genome-specific structure variation and extensive gene structure variation among the three subspecies. We detected three TO1000 specific inversions (on C01, C05, and C09) (Fig. S10). Despite the high quality of the TO1000 genome assembly, we still suspected that it might contain assembly errors. However, we had no other evidence to support the reliability of the TO1000 assembly (i.e., Hi-C data or BioNano data). In addition, approximately 17% of the annotated genes were nonsyntenic in each subspecies of B. oleracea, a value significantly higher than in other species (10.66% and 3.60% between two subspecies of maize and rice) (Sun et al. 2018). Although we also identified PAV sequences of the three B. oleracea genomes, we were unable to verify it experimentally, which made it impossible to calculate the error rate in calling PAVs.

LTR-RT compositions are often observed to differ among different subspecies within a species, thus supporting their importance in the formation of subspecies (Ammiraju et al. 2007; Hawkins et al. 2006; Vitte et al. 2007). In Capsicum genomes, the amplification of athila LTR-RTs, members of the gypsy superfamily, led to genome expansion in C. baccatum (Kim et al. 2017). In Oryza species, lineage-specific massive LTR-RT bursts were detected in very recently diverged AA-genome Oryza species (Zhang and Gao 2017), and it was also reported that several lineage-specific transpositional bursts occurred in 13 domesticated and wild rice relatives (Stein et al. 2018). These results revealed that LTR-RTs were important drivers of speciation and diversification. In Brassicas, LTR-RTs played an important role of genome evolution. It was reported that the large proportion of LTR-RT and multiple rounds of LTR-RT bursts occurred in Brassica species, as well as uneven distribution and formation of LTR-RT hotspots on chromosomes (Cai et al. 2018; Yang et al. 2016; Zhang et al. 2018). However, the dynamics of subspecies-specific LTR-RT amplification are largely unknown in Brassicas. In the present study, we found genome-specific amplification of Gypsy-like LTR-RTs around 0–1 million years ago (MYA). In particular, the athila, tat, and Del families were extensively amplified in JZS during the last 0–1 MYA. These results reinforced the notion that the striking difference in the distribution of recent Gypsy-like LTR-RTs and the varied amplification patterns of the three specific families were closely related to B. oleracea genome diversification.

The insertion of genome-specific LTR-RT was closely associated with gene function. In cucumber, the glabrous mutation was controlled by a single recessive locus csgl3, and the loss-of-function of CsGL3 in the mutation was due to the insertion of an LTR-RT in the 4th exon of CsGL3 (Pan et al. 2015). In B. rapa, it was reported that a transposon insertion in the second intron of BrFT2 (BrFT2 was involved in flowering time regulation in B. rapa) was detected in one of the recombinant inbred line (RIL) parental lines; the BrFT2 transcript was not present in the parental line that harbored the mutated allele, and RILs carrying only the mutated BrFT2 allele showed delayed flowering (Zhang et al. 2015). In the bread wheat genome, enrichment of TE families in gene promoters was reported, and this was conserved between the A, B, and D subgenomes (Wicker et al. 2018), and it also has been reported that TEs were directly related to gene duplication and specific gene family expansion (Hoen et al. 2006; Kong et al. 2007). These results revealed the important role of the relationship between the insertions of LTR-RTs and gene functions. In this work, we found that the insertions of subspecies-specific LTR-RTs were related to gene modifications. In total, we found 143 LTR-RTs that were closely associated with 127 modified genes. We strongly believe that there were more modified genes associated with the insertions of LTR-RTs, since only genes that were colinear with A. thliana were used for detecting modified genes. However, we still could not explain how the genome-specific insertion of LTR-RTs induced the gene modifications that may have contributed to genomes diversification. LTR-RTs not only change the genomic structures through rapid self-replication to fuel the rapid turnover of intergenic regions but it may also induce gene modifications to direct changes in the functions of genes, thus driving genome diversification.

Materials and methods

Sample preparation and genome sequencing

B. oleracea sp. capitata homozygous line JZS (heading type) was used for sequencing and de novo assembly (Liu et al. 2014). High-quality genomic DNA was extracted from leaf tissues using a modified cetyltrimethylammonium bromide (CTAB) method (Allen et al. 2006), and then, the genomic DNA used for Illumina and PacBio library construction and sequencing. Libraries with an insert size of 20 kb for SMRT PacBio genome sequencing were constructed as previously reported (Pendleton et al. 2015), and these PacBio libraries were sequenced on the PacBio Sequel platform (Pacific Biosciences). Libraries for Illumina paired-end genome sequencing were constructed according to the standard manufacturer’s protocol (Illumina). Illumina reads were generated from three paired-end sequencing libraries with insertion sizes of 250 bp, 350 bp, and 500 bp, and these three libraries were sequenced on an Illumina platform with a paired-end sequencing strategy. The Hi-C libraries of JZS were constructed following the pipelines described in a previous study (Grob et al. 2014), and the resulting libraries were submitted to an Illumina HiSeq 4000 sequencing device with 2 × 125 bp reads.

De novo assembly of PacBio and Illumina reads

A hybrid assembly strategy was used to complete the assembly of the JZS draft genome. Approximately 38 Gb (~ 61 ×) PacBio SMRT reads and 53 Gb (~ 84 ×) Illumina reads were used for scaffold assembly with MaSuRCA (Zimin et al. 2017). As recommended by the software developer, we used the raw Illumina and PacBio reads and the default parameters to hybrid assemble the JZS draft genome. This procedure resulted in a total assembly length of 561.11 Mb with an N50 length of 3.05 Mb (Table S2), and then BUSCO (Waterhouse et al. 2018) was used to perform a preliminary assessment of the assembly results.

Correction of misjoins in scaffolds

To detect misjoins in the hybrid assembled scaffolds, we developed a reliable misjoins correction pipeline (named MisjoinDetect) based on the Hi-C data. Our pipeline included the following three main steps. First, detection of regions of candidate misjoins. fastp (Chen et al. 2018) was used to filter low-quality Hi-C reads, and then, clean reads were mapped onto the initial assembled scaffold sequences by HiC-Pro (Servant et al. 2015). Meanwhile, scaffolds were divided into different segments according to the fixed bin size, and the interaction values between all of the fragments within each scaffold were extracted and used to form an interaction matrix. The regions of the candidate misjoins were defined based on the difference in the interaction values of adjacent bins. Second, we determined the locations of the breakpoints. The program first searched for gap information in the candidate area. If the gap existed, the gap area would be deleted, and the location would be defined as a breakpoint. (These errors were caused by different contigs being incorrectly connected by the de novo assembly software.) If it did not exist, the program would use the midpoint of the two genes in the middle of the candidate region as a potential breakpoint to ensure that the gene sequence was intact. Moreover, we provided an additional script that relied on a collinear list of genes from the related species. Based on the collinearity results, we could more accurately determine the location of the breakpoint to ensure a more complete syntenic region. Finally, clean scaffolds sequences and an Hi-C contact map were obtained for each corrected scaffold.

Construction of pseudomolecules and evaluation

The highly efficient pipelines developed by Aiden Lab (https://aidenlab.org) were used to anchor the 02–12 clean scaffolds onto the 9 chromosomes. Juicer (Durand et al. 2016) was used to align clean Hi-C reads to corrected scaffolds, and then, 3D-DNA (Dudchenko et al. 2017) was used to anchor corrected scaffolds onto chromosomes (-m haploid -e). Finally, Juicebox (Robinson et al. 2018) was used to visualize linked results by 3D-DNA, and we manually determined chromosome boundaries and some small errors. D-GENIES (Cabanettes and Klopp 2018) was used to perform synteny alignments between JZS assembly and HDEM. BUSCO (Waterhouse et al. 2018) was further used to evaluate the genome-assembly completeness, and 1440 single-copy orthologous genes were used as a dataset. Meanwhile, LAI index (Ou et al. 2018) was also used to evaluate the continuity of the assembly.

Gene prediction and function annotation

We used RepeatMasker (Tarailo-Graovac and Chen 2009) to mask the whole genome sequences, and then, gene prediction was based on the masked genomic sequences. The gene prediction process consisted of the following four steps. First, extraction of ab initio gene models. AUGUSTUS (https://github.com/Gaius-Augustus/Augustus) and GeneMark (Besemer and Borodovsky 2005) were used for de novo gene prediction. Second, GeneWise (Birney et al. 2004) with default parameters was used to detect homologous gene models. Third, detection of genetic models was supported by RNA-seq data. All of the transcriptome data were downloaded from NCBI (SRS472277, SRS472450, GSM1052958, GSM1052959, GSM1052960, GSM1052961, GSM1052962, GSM1052963, GSM1052964). Then, Trinity (Grabherr et al. 2011) and PASA (Haas et al. 2003) were used to predict genes. Finally, EVidenceModeler (Haas et al. 2008) was used to merge all of the gene model predictions. In our study, we filtered all low-quality gene models. Low-quality annotated genes included genes without start codons, genes ending without a stop codon, and gene lengths less than 50 bp. InterProScan (Hunter et al. 2009) was used to annotate motifs and domains, and we extracted gene ontology from output of InterProScan. Each of these annotation datasets were freely available from the BRAD database. https://brassicadb.org/brad/datasets/pub/Genomes/Brassica_oleracea/V2.0/.

Annotate transposable elements and identify LTR-RTs associated with syntenic gene modification

EDTA package (Ou et al. 2019) was used to construct a non-redundant TE library, and then, transposable elements (TE) were annotated and classified using RepeatMasker (Tarailo-Graovac and Chen 2009) with default parameters. Intact LTR-RTs in JZS v2, HDEM and TO1000 were identified using LTR_Finder (Xu and Wang 2007) with the parameters ‘-D 15,000—d 1000—L 7000—l 100—p 20—C—M 0.9′, and LTR_retriever (Ou and Jiang 2018) was further used to categorize the detected LTR-RTs into the subgroups of Copia-like and Gypsy-like LTR-RTs. The insertion time of the intact LTR-RT was extracted from outputs of LTR_retriever (the base substitution rate 1.3 × 10–8 was adopted in our work).

First, we calculated syntenic gene pairs between A. thaliana and JZS v2, TO1000 and HDEM genomes and obtained a syntenic gene list. Then, we identified whether the LTR-RT sequence was included in the region where the corresponding syntenic gene was not detected. If the LTR-RT could be detected, we extracted the upstream and downstream syntenic genes and manually checked.

Phylogenetic inference

First, we used OrthoFinder (Emms and Kelly 2015) to detect single copy genes among Chiifu, TO1000, HDEM and JZS genomes (Belser et al. 2018; Parkin et al. 2014; Zhang et al. 2018). The coding sequences (CDS) of 15,422 single-copy gene families within the four genomes were aligned at the nucleotide level using MAFFT (Katoh et al. 2005), and well-aligned regions were extracted using Gblock (v0.91b) (Talavera and Castresana 2007) with − t = p, − b4 = 5, − b5 = h. Finally, we used RAxML (Stamatakis 2014) with PROTGAMMAWAG model and 500 bootstrap replicates to construct the tree.

Genome blocks and centromere detection in the JZS genome

SynOrths (Cheng et al. 2012b) was used to perform synteny analysis in this work. We used syntenic gene pairs between JZS v2 and A. thaliana to construct three sub-genomes and defined GBs in the JZS v2 genome. First, we conducted syntenic analysis between JZS v2 and A. thaliana, and then, the least fractionated (LF), the medium fractionated (MF1) and the most fractionated (MF2) subgenomes of JZS v2 were built by previously reported methods (Cheng et al. 2012a). Then, using the method of defining the genomic blocks (GBs) in B. rapa (Zhang et al. 2018), we defined GBs in the JZS v2 genome. Centromere regions in each chromosome were defined by using mummer (Kurtz et al. 2004) to map centromeric repeat sequences (CentBr, CRB, TR238, and PCRBr) (Koo et al. 2011; Lim et al. 2007) to genome sequences with parameters (–maxmatch—g 500—c 16—l 16).