Introduction

B chromosomes (Bs) are accessories to the normal chromosome set (As) that is present in some individuals of a large number of diverse eukaryotic species. These extra chromosomes usually do not recombine with members of the A chromosomes and do not follow the rules of Mendelian segregation (Jones 2018). They are mostly heterochromatic, composed of a large amount of repetitive DNAs, are not needed for survival or reproduction of individuals, and are maintained through a drive-parasitic mechanism (Camacho 2005). Drive is a specialized process by which Bs can escape elimination during the cell cycle and be transmitted at a higher rate than the normal Mendelian transmission frequency of As (Houben 2017; Jones 2018). The combination of next generation sequencing (NGS) and modern bioinformatics technologies has provided new methods to identify B-located sequences (Ruban et al. 2017). Recent works suggest that Bs are comprised of fragmented or integral sequences derived from different As and from organelle DNA (Houben et al. 2014; Valente et al. 2014; Banaei-Moghaddam et al. 2015; Ruban et al. 2017). Despite the increase of knowledge about the genomic content, origin, and pattern of evolution of Bs, the biological significance of these elements still remains unclear. Recent studies have shown that B chromosomes carry transcriptionally active DNA sequences and also influence the transcription of other sequences in the cell that could play over a variety of cellular functions (Carmello et al. 2017; Houben 2017; Navarro-Domínguez et al. 2017, 2019; Ramos et al. 2017; Valente et al. 2017).

Among the African cichlid fishes, B chromosomes were first described in Astatotilapia latifasciata from Lake Nawampasa, a satellite lake of the Lake Kyoga system (Poletto et al. 2010). Previous analyses of the B chromosome in A. latifasciata were based on cytogenetics and comparative genomics studies. Cytogenetics confirmed that both sexes of A. latifasciata can carry either no, one, or two similar B chromosomes enriched with many repetitive DNA sequences (Poletto et al. 2010; Fantinatti et al. 2011). The genomic content of A. latifasciata Bs was assessed by comparative coverage analysis of 0B and 2B individuals, followed by experimental validation of qPCR and fluorescence in situ hybridization (FISH) mapping as well as sequencing of microdissected Bs (Valente et al. 2014) that detected an enrichment of transposons, thousands of degenerative genic sequences, and a few complete genes. The complete genes were found to be related to functional terms such as “cell cycle” and “chromosome-associated.” Furthermore, it was revealed that some of the intact genes are potentially transcriptionally active. Recent functional studies in A. latifasciata have revealed transcriptional variations of diverse DNA sequence classes including protein code genes, non-coding RNAs, and repetitive sequences (Carmello et al. 2017; Ramos et al. 2017; Valente et al. 2017; Coan and Martins 2018) related to B chromosome presence. However, a better comprehension of genes and functional sequence organization in the genome has been hindered due to lack of an assembled genome for this species. Among the genes discovered in the B chromosome of diverse species (see for review Ahmad and Martins 2019), there is a morphogenesis-related gene named Indian hedgehog b (ihhb) detected in the B chromosome of the cichlid Lithochromis rubripinnis (Yoshida et al. 2011). Development and morphogenesis have appeared as enriched terms based on gene ontologies (GO) of B chromosome genes described for diverse species (Ahmad and Martins 2019). The hedgehog (hh) gene family was firstly reported in Drosophila, and it is involved with embryo polarity and code proteins with two types of domains that function during the embryonic development of skeletal and nerve systems (Nüsslein-Volhard and Wieschaus 1980). The hh family of vertebrates has three genes called sonic hedgehog (shh), desert hedgehog (dhh), and Indian hedgehog (ihh), each with different pattern of expression and playing important roles in diversification and complexity of vertebrates (Pereira et al. 2014). It has been suggested that the paralogous of these hh genes have evolved under different evolutionary rates after the whole genome duplication (WGD) events in vertebrates (Pereira et al. 2014). The presence of ihhb copies in the B chromosome of a cichlid species (Yoshida et al. 2011) opens the discussion of the possible effects of B chromosomes over important biological features.

Here, we present de novo assemblies and annotation for A. latifasciata genomes with and without B chromosomes, and characterize the genomic diversity of samples containing B chromosomes. These genome assemblies reveal important aspects of B chromosome biology. We detected extensive genomic rearrangements related to the B chromosome presence and identified thousands of coding genes harbored in the B chromosome. In addition, we performed an analysis of sequence coverage, coupled with FISH mapping, which revealed the existence of high copy number of inactive ihhb gene emerging as a major structural component of the B chromosome.

Materials and methods

Chromosome preparation, DNA sampling, and genome sequencing

Astatotilapia latifasciata samples were karyotyped by classical chromosome preparation protocols to check the number of B chromosomes and the metaphasic chromosome suspensions were stored for FISH mapping. The samples were named as B− (absence of B chromosomes) or B+ (presence of 1 or 2 B chromosomes). FISH probes were obtained from genomic DNA via polymerase chain reaction (PCR) (Fantinatti and Martins 2016).

High quality genomic DNA (gDNA) samples from five karyotyped A. latifasciata specimens (males and females) with 0B and 1B chromosome were selected for next generation sequencing (NGS). We also reanalyzed two additional read datasets from Valente et al. (2014) (Table 1). The Illumina libraries were sheared to an average size of 350–550 bp using an S220 focused ultrasonicator (Covaris Inc., Woburn, USA) and prepared using the TruSeq DNA sample preparation kit ver.2 rev. C (Illumina Inc., San Diego, USA). Paired-end sequencing was performed using the Illumina HiSeq 1000 and MiSeq platforms. Read quality was checked using the FastQC software (Andrews 2010). Data filtering was performed using the FASTX toolkit (Gordon and Hannon 2010), retaining only reads with a minimum of 90% of nucleotides showing at least 30 in Phred quality score. Reads containing Illumina adapter sequences were eliminated using BLASTn search (E-value ≤ 10e−5 and 90% of identity as cutoff parameters) and customized Python programming script. Reads without a mate (singletons) were discarded by Pairfq software (https://github.com/sestaton/Pairfq). The coverage was calculated for all samples by the equation Cov = (rc × rl)/S, where rc is the read count, rl is the read length, and S the genome size. We considered A. latifasciata genome size comparable to Metriaclima zebra genome (O’Quin et al. 2013) since this is the most closely related cichlid genome to A. latifasciata. All datasets are available at Sacibase (sacibase.ibb.unesp.br).

Table 1 Illumina sequencing data obtained for A. latifasciata. M, male; F, female; 0B, sample without B chromosome; 1B, sample with 1B chromosome; 2B, sample with 2B chromosomes

Genome assembly and quality evaluation

The reads passing pre-processing filters were sorted into two groups: B− and B+ data. They were then passed to KMERGENIE (Chikhi and Medvedev 2014) to obtain the optimal kmer values for genome assembly. The separate datasets were used to produce two de novo assemblies: B− and B+ genomes, using the Velvet assembler (v 1.2.08) (Zerbino and Birney 2008). This assembler was recommended by Assemblathon 2 competition (Bradnam et al. 2013). The parameters used were “-ins_length 500, exp-cov auto” and “-unused_reads yes, read_trkg yes.” To close the assembly gaps within scaffolds, we ran the GAPFILLED algorithm (Boetzer and Pirovano 2012) using parameters (−‘m’ = 80, ‘-t’ = 10, ‘-g’ = 5). We computed several metric values (length, number, length variation, N50, gap length) of each assembly using QUAST software (Gurevich et al. 2013). To evaluate the completeness of the B− and B+ assemblies, we searched for a set of 453 core eukaryotic genes using the CEGMA (version 2.4) pipeline (Parra et al. 2007). The assembled genomes are available at Sacibase database.

Genome annotation

We used three methodologies for gene annotation of B− reference assembled genome: identity to known genes available in current databases, de novo prediction, and transcript sequences-alignment using the MAKER v2.31.8 pipeline (Cantarel et al. 2008). We annotated repetitive elements using a custom fish and a general metazoan database. We used coding (CDS) and protein sequences from Danio rerio (http://www.ensembl.org/Danio_rerio) to annotate genes. We also used transcriptomes from A. latifasciata (Nakajima et al. in preparation) to inform the annotation. We used the lamprey training set available in Augustus software (Stanke et al. 2004) for gene prediction.

The annotated virtual genes were extracted from the assembled genome using customized unix scripts and mapped to a D. rerio protein database, retrieved from ensemble (ftp://ftp.ensembl.org/pub/release-91/fasta/danio_rerio/pep/). The BLASTx mapping at NCBI was performed with a minimum E-value of 1xE−5 and the output xml formatted file was imported into BLAST2GO (Götz et al. 2008). We mapped the resulting aligned proteins of the corresponding query genes against the GO database to obtain the functional information. The functional annotation was processed using default parameters for all gene functions.

Comparative genomics and genome diversity analysis

Genomic rearrangements were identified comparing the two genome assemblies (B+ and B−) based on whole genome pairwise sequence alignments approach. For this purpose, we used minimap2 (Li 2018) mapping and the output results were plotted as dotplots using an R script called dotPlotly (https://github.com/tpoorten/dotPlotly). To better interpret the dotplots patterns of syntenies, we also perform self-alignments between B+ with B+, and B− with B− genomes. Finally, we analyzed the rearranged blocks identified from the alignments of B+ with B− at several filtering steps of query length and mapping parameters. The filtering of mapped blocks was considered for better visualization of syntenies to explore the differences between the two genomes.

We also annotated genes and repeats that were duplicated in the B+ genome. This analysis consisted of several steps. First, we identified these regions by the number of times they were repeatedly aligned to the query sequence of the B− genome. If multiple scaffolds of the B+ assembly align to the same B− target, then the segment is considered duplicated in the B+ genome. To ensure the identification of duplicated block, we compared the total scaffold size to its respective alignment size and calculated the duplicated copy ratio by the following equation, DR = (al × ql)/1, where DR is the duplicated copy ratio, al is the alignment block length, and ql is the total query sequence length; 1 represents a regular single copy in the genome. The DR for the same alignment entries was then summed for each query sequence to yield a total value. The total DR of a specific scaffold with values greater than 2 indicates the recurrence of two copies of a given sequence. We expect for a regular, not duplicated sequence DR value of 1. We extracted the duplicated blocks by customized bash scripts using a threshold of at least twice repeated alignments of a similar scaffold. This extraction was done from the B+ and B− alignments file generated by minimap2 (paf, Pairwise mApping Format). We then mapped these blocks to the reference coding sequence (CDS) as well as proteome database of zebrafish by BLAST. The blocks were also subjected to RepeatMasker program (Smit et al. 2013-2015) to identify transposable elements. Finally, the functional annotation and GO enrichment of the blocks was done using BLAST2GO pipeline. The enrichment of GOs was plotted using Revigo (Supek et al. 2011).

The read datasets of A. latifasciata (males and females with 0B, 1B, and 2B chromosomes, Table 1), were aligned to our B− genome using Bowtie2 (Langmead and Salzberg 2013) with “--very-sensitive” option. Nucleotide polymorphisms were identified using SAMtools (Li et al. 2009) to search for genome variations among the samples. The output variant call format (VCF) files were passed to VCFtools (vcf-stats and vcf-compare) (Danecek et al. 2011) for statistical analysis to discover the frequency of SNPs and insertions/deletions (indels). We compared these variant calls to identify shared, unique, or B chromosome–related variations. Filtering was carried out to eliminate the lower quality (Q ≤ 20 and DP ≥ 100) SNPs and indels using vcflib (Garrison 2012). A similar approach was followed to detect variations in the genomic data aligned to de novo transcriptome assembly of A. latifasciata (Nakajima et al. in preparation).

The structural variations (translocations) were analyzed based on 1B sequencing data using Delly (Rausch et al. 2012). This pipeline was applied to both B+ and B− reads in Bam format (generated using bowtie2 tool) taking B+ as the query and the B− de novo genome assembly of A. latifasciata as reference to locate the variations on the scaffold regions. Translocations (breakpoints) were visualized by ClicO (Cheong et al. 2015), an online web-service (http://codoncloud.com:3000/home) based on Circos (Krzywinski et al. 2009).

Structural variations (SVs) such as deletions, insertions, transversions, inversions, and duplications in genomic regions related to the B chromosomes were analyzed by inGAP-sv tool (Qi and Zhao 2011). InGAP-sv detects SVs on the basis the pattern and coverage of mapped paired-end reads. We applied this pipeline to the 2B SAM file generated using BWA (Li and Durbin 2009). The B− assembled genome was used as a reference. After the SAM file was loaded into inGAP-sv, a threshold of mapping quality (default value: 20) was applied to filter non-uniquely mapped reads. Illustrations of paired-end mapping (PEM) patterns for different types of identified SVs were generated according to Qi and Zhao (2011).

Analysis of B-specific sequences

The nucleotide sequences of several genes previously identified on B chromosomes in vertebrates were retrieved from NCBI (Table S1). Consensus sequences constructions were obtained using Geneious v. 4.8.5 software (Drummond et al. 2009) for genes with more than one sequence available. The final sequences were used as queries against the B− assembly in a standard BLASTn search. The number of BLAST hits, E-values, and percent identity were evaluated before proceeding further with B chromosome–related analyses. Although most of the genes were either absent, or had only partial sequence in the A. latifasciata genome, the 45S rRNA and ihhb (Indian Hedgehog B) genes were considered for future analysis because of their high level of integrity. The Illumina high coverage reads of all B− (0B male and female) and B+ (1B males, 1B female, and 2B male) samples were aligned to both reference 45S rRNA and ihhb gene copies described for the cichlids Oreochromis aureus and Lithochromis rubripinnis respectively, using the paired-end mode of Bowtie2 (very sensitive alignment option). The outputs were converted to binary format and indexed using samtools. Each file was normalized using RPKM (reads per kilobase per million mapped reads) package of deeptools (Ramírez et al. 2014) to correct bias in initial coverage. These files were then visualized using the integrated genome browser IGB (http://bioviz.org/) to compare coverage of both genes in B− and B+ samples. SNPs at different sites of reads were detected. The transcription of 45S rRNA and ihhb genes was assessed based on the available reads datasets of A. latifasciata transcriptomes (Nakajima et al. in preparation). The uploaded transcriptomic and genomic data (aligned files) were visualized and manually evaluated in Sacibase. BLASTn searches of 45S rRNA and ihhb genes in A. latifasciata transcriptome assembly (Table S2) (Nakajima et al. in preparation) were conducted to locate those genes in specific scaffolds.

Primer designing, probes construction, and fluorescence in situ hybridization mapping of ihhb and 45S rRNA genes

Primers were designed for the ihhb gene and the 45S rRNA cistron (including the transcribed segments for the 5.8S, 18S, and 28S rRNAs) (Table S3) using PrimerQuest (http://www.idtdna.com/primerquest/home/index). Specificity of the primers was checked using Primer-Blast (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and evaluated by Primer Stat (Stothard 2000). Genomic DNA was subjected to PCR to obtain DNA probes for FISH mapping. DNA fragments obtained by PCR were sequenced (Sanger et al. 1977) using an ABI Prism 3100 automatic DNA sequencer (Applied Biosystems, Foster City, USA) with a Dynamic Terminator Cycle Sequencing Kit (Applied Biosystems) as per the manufacturers’ instructions. The sequences obtained were subjected to BLAST (Altschul et al. 1990) searches at the NCBI to confirm if they correspond to the annotated genes. Probes were labeled with digoxigenin-11-dUTP (Roche Applied Science, Penzberg, Germany) using PCR and the signal was detected with anti-digoxigenin-rhodamine (Roche Applied Science). FISH was performed using the protocol described by Pinkel et al. (1988) with modifications (Cabral-de-Mello et al. 2012). The slides were denatured in 70% formamide/2xSSC, pH 7, for 36 s, and dehydrated in an ice-cold ethanol series (70%, 85%, and 100%). The images were captured with an Olympus DP71 digital camera coupled to a BX61 Olympus microscope (Olympus Corporation, Tokyo, Japan) and were optimized for brightness and contrast using Adobe Photoshop CS2.

Results

Assembly and gene annotation of the A. latifasciata draft genome

A total of seven B− and B+ Illumina sequenced samples were analyzed including six individual data sets obtained with the Hiseq and one with the Miseq platform (Table 1). After filtering, a total of 513,441,660 B− reads with 80.5× coverage and 850,418,274 B+ reads with 119.9× coverage were recovered for the analysis (Table 1). The de novo assembly of A. latifasciata B− reference genome was 771,316,069 bp long. Based on the size of M. zebra genome, we estimate that about 84% of the A. latifasciata genome was recovered. The B− assembly yielded 218,259 scaffolds with N50 of 18,640 bp, being the longest scaffold with 233,669 bp (Table 2; Table S4). The B+ assembly was 781,068,509 bp long, in 197,652 scaffolds with an N50 of 25,546 bp. The longest scaffold was 238,637 bp (Table S5).

Table 2 Comparison of the current assembly of B− A. latifasciata to other African cichlids and other fish species assemblies (statistics data sourced from NCBI)

The total number of complete and partial core eukaryotic genes (CEGs) recovered in the B− assembly were ~ 73% and ~ 94% respectively (Table S6). The annotation pipeline found 24,907 genes in the B− A. latifasciata genome assembly (Fig. S1; Table S7), comprising 22.4% of the total assembly. The annotated genes were also screened for the GO level distribution in three categories: biological processes (P), molecular function (F), and cellular component (C). Moreover, the highest number of GO terms was detected for biological processes (Fig. S2), and the most abundant sub-categorized functions are cellular process, biological regulation, and multicellular-organism process. Structural annotations as general feature format (GFF) file have been uploaded to Sacibase.

B chromosome polymorphism

We identified a total of 2,395,658 SNPs and 888,060 indels in the six genomes (B− and B+ individuals) when compared to B− assembly (Fig. S3). After filtering, we detected a total of 17,875 high quality SNPs in all genomes (Fig. S4a) and 1B female reads showed the highest number of SNPs (5,181) shared with all other individuals (Fig. S4b). However, in B+ individuals, a total 11,978 SNPs were identified relative to the B− reference genome (Fig. S4b). The SNP frequencies of genic sequences (CDS, exons and introns) in A. latifasciata were also screened by comparing coding sequences from B+ and B− reads datasets to the A. latifasciata transcriptome assembly (Nakajima et al. in preparation), which detected a total of 16,839 SNPs in four samples, two from each B+ and B− both male and female (Fig. S4c, S4d). However, male samples, irrespective of B chromosome, had a higher frequency of SNPs in genic sequences (10,508) as compared to female samples (6,331). Comparative analysis of different SNP combinations from B+ and B− samples confirmed unique and shared sets of SNP variation. We identified a total of 687 genomic (Fig. S4b) and 167 transcriptomic (Fig. S4d) SNPs shared among all B+ samples, suggesting these SNPs are located on a B chromosome. Similarly, the B-specific SNPs were also identified in the B-located genes (see “Sequence analysis and physical mapping of B sequences”).

Whole genome rearrangements and structural variations

The whole genome comparative analysis of B+ with B− assemblies using pairwise minimap2 alignments generated a total of 849,600 alignments. The conserved syntenies between the two genomes were detected under the significance threshold E-value of < 1E−50, and the alignments results were visualized as dotplots. The expected high proportion of homologous sequences confirmed is apparent as diagonal lines of synteny (Fig. 1a), indicating highly similar conserved contents between the two genomes derived from the same species. However, we also observed breaks in synteny, duplicated blocks including ancient WGD, and several insertions, deletions, and inversions, which signal genome rearrangements. Many duplicated regions were visually observed in the dotplot comparison analysis in B+ genome (Fig. 1b). A total of 1,717 duplicated blocks were identified and retrieved from B+ genome. The genes annotation of these blocks detected 1,546 protein coding sequences, 8 pseudo-processed, and 3 unknown genes (Dataset S1). Selected alignments with Phred score ≥ 30 (Table S8) were used to determine the number of indels of the whole genome alignments, showing a few amounts of large indels between B+ and B− genomes (Fig. 2a). The total of indels comparing both genomes is 21,505,536 bp, which is ~ 2.78% of B− genome. Moreover, considering the BLAST-like alignment identity calculated from paf file, it is possible to observe that most of aligned sequences have high identity (Fig. S5). Since we want to highlight the divergences between B+ and B− genome, we selected the most dissimilar alignments using an identity ≤ 0.5 or ≤ 0.8 to perform the Circos plots (Fig. 2b, c). The Circos plots highlight several genomic blocks rearranged in the B+ genome (Fig. 2b, c).

Fig. 1
figure 1

Comparative whole genomics analysis of assembled B− and B+ A. latifasciata genomes. a Whole genome alignments between B+ and B− genomes assemblies are shown as dotplot depicting the total number of 1664 post filtered aligned blocks. B+ genome and B− genome assemblies are represented as Y- and X-axis respectively. The breaks in the colored diagonal line show the syntenic discontinuity pointing towards genomics rearrangements. The small dots slightly above the diagonal line represent those genomics blocks signaling duplicated regions in B+ genome. b Examples of specific genomic rearranged regions between both assemblies confirming segmental duplication, insertions deletions, and ancient WGD event. The diagonal line shows the similar regions between the two genomes with mean percent identity given according to the different colors indicated in the insert in a

Fig. 2
figure 2

Alignment analysis between B+ blocks and their counterpart in the B− genome. Graphical distribution (a) of indels size comparing B+ and B− genomes. Circos plots based on cutoff of ≤ 0.5 (b) and cutoff of ≤ 0.8 (c) of most dissimilar alignments comparing B+ and B− genomes. The outermost rings of b and c represent B+ (red) and B− (blue) sequences. Asterisks (*) in c indicate hotspots with larger segments aligned between both genomes. The blue lines connecting the red and blue outermost rings indicate genomic blocks conserved between B+ and B− genomes

The repeats annotation found the highest number of retrotransposons (a total of 444 elements) including SINES, L1, L2, Gypsy, and BEL/pao followed by DNA transposons (a total of 241 elements) including hobo activator and Tc1 (Fig. 3). The Fisher exact test resulted the GO enrichment of functions related to development, morphogenesis, cell cycle, binding, transport, immune system, and regulation of gene expression (Fig. 4 and Dataset S2).

Fig. 3
figure 3

Repeat elements composition of identified duplicated blocks in the B+ genome. The Y- and X-axis of the bar chart indicates the copy-number and type of elements respectively

Fig. 4
figure 4

Enrichment GO terms for the duplicated blocks in the B+ genome. Each circle represents a specific function. Only functions previously identified (Valente et al. 2014; Huang et al. 2016; Navarro-Domínguez et al. 2017) as related to B chromosome are highlighted. The Y- and X-axis have no essential meaning and represent just the graphical space. The bubbles are colored on the basis of log10 P values. Dark red circles indicate less enriched and dark blue indicates more enriched terms

We applied high-throughput and massive paired-end mapping (PEM) to identify structural variants (SVs) in B+ genomic data over the B− genome. A total of 625 interchromosomal translocations (breakpoints) were detected in the whole B+ genome (Fig. 1a). We annotated a few of these regions (between 515 and 520 Mb coordinates, Scaffold NODE_552876) with identified translocations; most of them were fragmented genes and non-coding RNAs (Fig. S6 and S7). Genomic regions related to B chromosome (regions showing coverage higher than × 15 in the Illumina datasets) were also subjected to reads-orientation based on SVs detection method. Interestingly, we found duplications, insertions, and inversions at different sites in the B+ blocks (Fig. 5).

Fig. 5
figure 5

Structural variations including duplications, deletions, inversions, and transversions in a randomly selected B block (lower line) against the reference genome (upper line). Different color of links represent different orientations of the paired end reads mapped against the reference region indicating specific types of SVs as illustrated below: Gray links, normal reads; light blue, deletion event; green links, insertions; dark blue, translocations. An inversion causes the paired reads to change the orientation, and both ends will map to the same strand. Segmental tandem duplication is represented by one set of distantly mapped reads and one set of inverted mapped reads

Sequence analysis and physical mapping of B sequences

Among the rearranged and duplicated blocks of A. latifasciata, we analyzed those containing genes previously identified on B chromosomes of diverse species (Table S1). BLASTn results indicate a higher number of hits for ihhb and 45S rRNA genes in the genome of A. latifasciata. The remaining genes were not considered in the analysis because of partial or complete absence in the A. latifasciata genome. The B+/B− alignment shows higher coverage of the B+ than the B− samples for both for ihhb and 45S rRNA genes (Fig. 6; Fig. S8, S9, S10). The higher reads coverage of these genes in B+ genome evidences their duplicated copies in the B chromosome. We also conducted a survey to screen SNPs and indels. The ihhb gene has encountered few B-specific SNPs and indels (Fig. 7 and Fig. S9), while a high number of non-B-related SNPs were found in 45S rRNA cluster (Fig. 6 and Fig. S10). The available RNA-seq data was analyzed for both genes to evaluate their transcriptional level and we did not detect any transcripts of ihhb and 45S rRNA genes among B+ and B tissues of A. latifasciata. We found only few reads in some tissues but there were no whole transcripts for most of the tissues in both B− and B+.

Fig. 6
figure 6

Sequence analysis of B related genes. a Reads coverage of ihhb and 45S rRNA gene sequences. Notice the scale bar on left to differentiate the coverage rate between 0B and 2B samples. b Number of population and B-specific SNPs are shown as red and blue respectively in the bars for both genes. c Model representation of genes to elaborate their structure and localize the positions of SNPs in regions with comparison of B+ and B− individuals

Fig. 7
figure 7

FISH mapping of ihhb and 45S rRNA (5.8S rRNA, 18 rRNA, and 28S rRNA) genes. Metaphases stained with DAPI, B-probes, and merged are shown for each sequence. The ihhb FISH mapping was conducted in 2B metaphases. The B chromosomes are indicated

The FISH mapping revealed extensive hybridization of ihhb over the B chromosomes in 2B metaphases and scattered signals over the A chromosomes (Fig. 7). We FISH-mapped each of the 18S, 5.8S, and 28S rRNA transcriptional regions individually to investigate if the complete cluster of 45S rRNA has moved from A complement to the B chromosome (Fig. 7). Positive sites of 18S rRNA gene were observed over the pericentromeric and subtelomeric areas of the B chromosome and on pericentromeric regions and scattered over few A chromosomes. The 5.8S rRNA probe hybridized in the pericentomeric regions of the B chromosome and in telomeric and subtelomeric regions of autosomes. The 28S rRNA gene probe produced signals on telomeric and centromeric regions of the B chromosome and also in the short arms of few chromosomes. The amplified PCR products of ihhb and 45S rRNA genes used for the constructions of FISH probes were subjected to Sanger sequencing, and the sequences aligned against NCBI database by BLAST. The results found 98–99% identity, with the highest number of hits to ihhb and 45S rRNA genes.

Discussion

The A. latifasciata assemblies add a draft genome reference (based on N50, genome size, genes number, and % of GC statistics) in correspondence to other African cichlids (Brawand et al. 2014) and also novelties on the genomic content of B chromosomes. One of the key aspects in the B chromosome studies resides in understanding the genetic polymorphism/variations and their impact on the origin and evolution of B chromosomes. We checked the SNPs rate to evaluate the selective pressure levels in the genic sequences applying a similar approach suggested by Martis et al. (2012). According to Martis et al. (2012), the SNPs in the B genic sequences could indicate the action of different levels of selective pressure. In this context, our results identified a higher encounter of SNP frequencies in males suggesting they could be under lower selective pressure than female. Furthermore, the comparative analysis of genome and transcriptome datasets recorded shared SNPs among the B+ individuals, evidencing the accumulation of B-specific polymorphisms among B+ genomes.

Different types of SVs including complex rearrangements, duplications, inversions, large deletions, and insertions can be detected using a variety of computational approaches (Keane et al. 2014). A remarkable contribution to understand the evolutionary biology of chromosomes was achieved by revelation of SVs in diverse organisms (Bickhart and Liu 2014; Keane et al. 2014). The identification of many rearrangements has also been applied to explain the mechanism of sex chromosomes evolution (Rogers 2015). In this sense, we have carried out the whole genome rearrangement analysis to understand the molecular mechanisms of B chromosome evolution. A significant finding of our study was the identification of genomic rearrangements and syntenies as a result of comparison between B− and B+ de novo assemblies. Notably, the B+ assembly is 9,752,440 bp larger than B− assembly which may reflect the extra amount of B chromosome genomic contents thus being useful to trace the genomic changes which might have happened due to additional B chromosomal contents. A large B+ assembly could also be a consequence of the differences in the B+ and B− number of sequence reads used for assemblies. Although the comparative genomics approach using whole genome alignment indicated that both genomes mostly share a similar pattern of mapped blocks, as expected, still there are some discontinuities (chromosomal breaks) between the two genomes. We observed that despite both genomes belonging to the same species, we have encountered a variety of genomic changes. These changes are related to B chromosome presence and point towards significant events such as duplications, deletions, and insertions occurring during the evolution of B chromosome. Remarkably, we also found a few weak syntenies that could reflect relics of the teleost ancient WGD, a significant process of teleost evolution (Santini et al. 2009; Glasauer and Neuhauss 2014). Vertebrates have been characterized by two rounds of WGD occurring in their early evolutionary history, followed by a third round of duplication exclusive of the teleost fish lineage.

Apart from this analysis, we also discovered a large number of chromosomal breakpoints such as translocations extensively distributed throughout the whole B+ genome of A. latifasciata. Several sequence duplications detected in the B blocks indicated that these sequences originated from A chromosomes and accumulated on B chromosome due to frequent duplication events. The findings of our study support the previous observation (Valente et al. 2014) that most autosomes have contributed sequences to the B chromosome of A. latifasciata through gene duplication, as has been documented for the origin of supernumerary elements of diverse species (Teruel et al. 2010; Martis et al. 2012; Utsunomia et al. 2016). The discovery of a large set of SVs in B+ genome provoked the hypothesis that the B chromosome presence could offer additional genetic material to evolution and, thus, directly contributes to evolutionary processes in the populations. To trace these duplication events, the duplicated blocks analysis conducted in B+ genome enabled us to reveal different set of genes and repeats. The annotated repeats of the duplicated blocks enlisted all the retrotansposons (Gypsy, Bel/Pao/ and L2) that have recently been discovered on the B chromosome of A. latifasciata by combination of bioinformatics, qPCR and FISH mapping (Coan and Martins 2018). In addition, the higher amount of retrotransposons and DNA transposons in these B chromosome blocks suggests that these elements were major players in the duplication of B-located genes. B chromosomes are rich in several classes of repetitive DNAs, including mobile elements and derived sequences from rDNA, satellite DNA, histone genes, and small nuclear RNA genes (Friebe et al. 1995; Teruel et al. 2010; Bueno et al. 2013; Ruiz-Estévez et al. 2014; Silva et al. 2014; Marques et al. 2018). Furthermore, cytogenomics studies have also identified repetitive DNAs derived from regular genes and organelle sequences expanded over the B chromosome (Martis et al. 2012; Valente et al. 2014).

Among all genes described in B chromosomes of vertebrates (Table S1), two of them were identified in the duplicated B chromosome blocks: ihhb and 45S rRNA genes. Our study focused on uncovering B-specific SNPs and indels of both ihhb and 45S rRNA genes since they were found enriched in the B chromosome. Ihhb was also detected in the genome of the cichlid Lithochromis rubripinnis (Yoshida et al. 2011), with more than 40 copies of ihhb paralogs on B chromosomes and a single copy of ihhb ortholog on the A chromosomes. Our results suggest that the ihhb gene remains highly conserved among different vertebrate species as no population-wise polymorphism was detected. The higher number of B-specific SNPs and indels found in the ihhb gene, and absence of transcripts, reveal such copies on Bs have become pseudogenes. Several papers have emphasized the evolutionary importance of the hedgehog gene family and outlined the process of duplication events related to the members of this gene family including ihhb in many vertebrates (Holland 1992; Carroll 1995; Ekker et al. 1995; Kumar et al. 1996; Zardoya et al. 1996). We suggest that the observed increase level of polymorphism in B-located copies of ihhb gene is an interesting phenomenon to elucidate the mechanism of gene duplication and neofunctionalization and to understand the molecular evolution of B chromosome. During duplication, modifications such as insertion, deletions, and mutation might have occurred in the gene sequence, as in the case of ihhb; therefore, we term ihhb as a “non-processed or duplicated pseudogene.” More importantly, the ihhb gene is involved in several functions related to vertebrate morphogenesis and regulates the PTCH2 genes which have been reported to express in testis tissues (Carpenter et al. 1998) thus being involved in sexual development. Although the abundant ihhb copies in the B chromosome of A. latifasciata seems to be inactive, we can not rule out that ihhb expansion in the B chromosome of other cichlids could have found any function. The recent studies on B chromosomes of cichlids have outlined their role in the sex-related functions mainly sex determination (Yoshida et al. 2011; Clark et al. 2017). Our genomic analyses of this gene in A. latifasciata followed by FISH mapping document the novelty of its organization and expansion on the B chromosome, and raise questions about its role in the B evolution and function.

Based on previous descriptions of ribosomal genes on the B chromosome of A. latifasciata (Poletto et al. 2010) there rises a hypothesis that the complete 45S ribosomal DNA (rDNA) cluster may have started amplifying from the A complement and moved to B chromosomes during initial stages of evolution of the proto-B chromosome. From the sequencing data analysis, we found that some regions of the cluster indeed showed a higher coverage in B+ samples as compared to B−; however, no distinct differences were found in overall coverage of the 45S cluster. Our FISH mapping results of 18S rRNA and 28S rRNA confirmed that extra rRNA gene copies have accumulated on the B chromosome. The weaker chromosomal signal of the 5.8S rRNA segments seems to be related to its small DNA size compared to the 18S and 28S rRNA transcribing segments. The 5.8S, 18S, and 28S rRNA marks on B chromosome enabled our hypothesis about the duplication and expansion of the whole rRNA cluster from A to B chromosome. We did not find transcripts of 45S rRNA gene cistron in the A. latifasciata genome. Previous analysis of nucleolus organizer regions (NORs) activity did not detect transcriptional evidences of B chromosomes rRNA copies (Poletto et al. 2010).

In addition to the presence of repeated DNAs, recent studies have added an extensive list of genes or fragments of genes present in B chromosomes (Ahmad and Martins 2019; Houben et al. 2019). The identification of a few genes on B chromosomes has been achieved in the last three decades through classical genetics studies (Dherawattana and Sadanaga 1973; Miao et al. 1991a, b; Graphodatsky et al. 2005). Bioinformatics and genomics tools developed in the last decade have revealed a higher number of B-located genes and functional sequences and started a new debate about the evolutionary role of B chromosomes, their complex interactions with the host genome, and their possible effects ranging from sex determination to development and adaptation. These studies have been extensively conducted in diverse organisms as fungus (Coleman et al. 2009; Goodwin et al. 2011; Bertazzoni et al. 2018), plants (Banaei-Moghaddam et al. 2013; Ma et al. 2017), insects (Akbari et al. 2013; Navarro-Domínguez et al. 2017, 2019) and vertebrates (Trifonov et al. 2013; Valente et al. 2014; Clark et al. 2018; Makunin et al. 2018) and have highlighted a long list of genes present in the B chromosomes. These analyses identified several high integral putative B genes related to functions/structures such as pathogenicity (exclusive for fungus), cell cycle, chromosome organization, cytoskeleton, development, and neural system. There is a strong correspondence in the identification of cell cycle genes in the B chromosomes reported by several of these studies. The gene annotation of A. latifasciata B+ blocks showed significant GO enrichment of functions related to diverse biological process, including cell cycle. Based in the large scale accumulated data for A. latifasciata (Valente et al. 2014, 2017; present study) it seems plausible that B chromosomes can modulate the cell physiology in a very complex way, including the control of cell-cycle regulatory mechanisms of the B drive. The B is in a constant co-evolutionary battle with the A genome and a drive seems to be the first step to the B stability and survival. Furthermore, we can not rule out that the B chromosome offers an independent genome compartment for genome adaptation and innovation.

Conclusion

Our analysis brings contributions including (1) generation of a genome draft for A. latifasciata useful in future analysis involving evolutionary and applied genomics; (2) screening the high coverage sequencing data of different individuals for polymorphism, gene diversity, and B-specific structural variations, the nucleotide polymorphism identified in B sequences are useful to track evolutionary history and, also, functional aspects of Bs; (3) discovery of B chromosome linked genes/sequences; (4) duplication events generated higher level of structural variations associated with B chromosome and a higher number of copies of gene/sequence variants in the B chromosome; and (5) our data provoke the hypothesis that supernumerary chromosome presence adds new evolutionary genomic components to the cells and organisms.

Assembly and data files

Genomic and transcriptomic datasets are available at Sacibase Database (sacibase.ibb.unesp.br). Sequencing data of PCR products of 28S rRNA, 18S rRNA, 5.8S rRNA, and ihhb gene sequences are available in NCBI database Genbank accession numbers MK182936, MK182937, MK185008, and grp6845354.