Introduction

Sweet sorghum (Sorghum bicolor (L.) Moench) is an ideal crop for future food, feed, and fuel security because of its high cellular biomass, stalk sugar content, and average grain yield. The crop is further characterized by high photosynthetic, water, and nutrient use efficiency; it helps to withstand changing environmental conditions (Rao et al. 2019). Therefore, it is a potential alternative to fossil fuels to achieve future bioenergy needs. The current breeding schemes need an overhaul to adapt to changing environmental conditions. Instead of relying solely on recombination within one population, we should consider exploring variability across sorghum species and sub-types. Its diverse primary races and intermediate varieties offer substantial natural variation (Venkateswaran et al. 2018). However, intensive selection and breeding have led to a loss of genetic diversity in modern germplasm, especially in specific agroecological zones (Smith et al. 2019).

Studies from the past suggest that natural selection played a big role in the evolution of sweet sorghum. Midrib color and stem juiciness are controlled by non-functional alleles for genes related to secondary cell wall development (Zhang et al. 2018). Interestingly, the midrib color strongly correlates with traits such as sugar yield, juice volume, and moisture content (Burks et al. 2015). Both sweet and grain sorghum show remarkable differences at the phenotypic level and are supposed to carry genome-level changes due to accumulating mutation load during crop domestication (Jiang et al. 2013; Smith et al. 2019). The primary gene pool of sweet sorghum constitutes 35,467 genes along with several Structural Variations (SV). Past studies reported deletions were more frequent in sweet sorghum than duplications, marking an important aspect of sweet sorghum evolution (Cooper et al. 2019). Comparative genomics of sweet and grain sorghum genomes identified distinct genes that were associated with variations which helped to distinguish between these sub-types at the genome level (Zheng et al. 2011).

Most genome-wide studies emphasize the number of genes annotated in the reference genome; however, some part of the genome is only shared by a subset of individuals within the species, termed as dispensable genome (Yao et al. 2015). In addition, dispensable genes were associated with complex genomic regions affected by SV, and are likely to be missed during the reference genome assembly (Gerdol et al. 2020). The comparative studies using reference genomes from a single organism or sub-types are unreliable, as the population shows considerable variation in intraspecies genomes (Bhatti et al. 2020). The third revolution in sequencing technologies leads to a decreased cost of sequencing (Jiao and Schneeberger 2017), allowing researchers to sequence more individuals to trap a major portion of dispensable genes among the population. RNA-seq offers a cost-effective alternative to genome sequencing for identifying functional genes and regulatory elements in plants with complex genomic architecture (Jin et al. 2016).

Representative Transcript Assemblies (RTAs) are a group of genes in the cluster that work as a superstructure and are a great way to characterize core, dispensable, and private genes in a population using RNA-seq datasets (Hirsch et al. 2014). In the present investigation, we have constructed RTAs or superTranscripts; these superTranscripts facilitated population-level identification of genes and non-coding RNAs and helped improve existing sorghum genome annotations. The superTranscript-based gene presence/absence on 15 sorghum genomes helped to provide a better understanding of the evolution of domesticated sorghum.

Material and Methods

Sorghum Genomes/Transcriptomes and Pre-Processing

The Sweet Sorghum Reference Genome version 2.1 (SSRG), along with 13 cultivated sorghum genomes belonging to diverse racial types were retrieved from Phytozome [https://phytozome-next.jgi.doe.gov] and SorghumBase [https://www.sorghumbase.org] from past studies (McCormick et al. 2018; Varoquaux et al. 2019; Cooper et al. 2019; Voelker et al. 2023) (Supplemental Table S1). This includes 9 parent genotypes from the Carbon-Partitioning Nested Association Mapping (CP-NAM) population representing diverse bioenergy sorghum sub-types and 4 cultivated sorghum reference genomes (Boatwright et al. 2021) and one elite sweet sorghum cultivar Wray (Broadhead et al. 1981). To compare the number of annotated genes and PAVs between cultivated sorghum (Sorghum bicolor subsp. bicolor) and its wild progenitor, one wild sorghum genome (Sorghum bicolor subsp. Verticilliflorum, accession 353) was included from sorghum pan-genome study (Tao et al. 2021). The pan-genome representing 390 diverse bioenergy sorghum accessions was downloaded from the ICRISAT repository [http://dataverse.icrisat.org] (Ruperao et al. 2021). The 223 sweet sorghum RNA-seq accessions for seven diverse sweet sorghum genotypes namely Rio (Li et al. 2019b; Cooper et al. 2019), Keller, SIL05 (Mizuno et al. 2016), Della, Dochna (Zhou et al. 2022), Roma (Sui et al. 2015), and M-81E (Sui et al. 2015) from past studies were retrieved from NCBI Sequence Read Archive (SRA) database using IBM Aspera Connect data transfer protocol and quality trimmed with FastP (Chen et al. 2018) to remove poor quality reads and adapter contamination for assembly (Supplemental Data S1).

superTranscriptome Construction

We used the Necklace (Davidson and Oshlack 2018) pipeline to build the superTranscriptome with a few improvements. It includes individual implementation of genome-guided, de novo transcriptome assembly, clustering, superTranscript construction, and gene expression counts steps for large datasets to avoid time loss due to errors (Fig. 1a; Supplemental Figure S1). The RNA-seq reads from sweet sorghum were aligned to SSRG using HISAT2 (Kim et al. 2019) with default parameters and known splice site information. The Sequence Alignment Map (SAM) to Binary Alignment Map (BAM) conversion and BAM sorting was done using SAMtools (Li et al. 2009). Genome-guided transcript assemblies for individual samples were obtained using StringTie (Pertea et al. 2015) with default parameters and merged into a single *.gtf file with the stringtie –-merge option. Further, merged *.gtf was flattened with gtf2flatgtff.pl to extend the gene boundaries for getting longer-length, genome-guided transcripts. Finally, genome-guided transcripts were obtained with GffRead (Pertea and Pertea 2020). The RNA-seq reads further assembled into de novo transcripts by using Trinity (Grabherr et al. 2011) with –max_memory 100 G. Protein-coding ORFs for de novo transcripts were obtained with grain sorghum (BTX623, BTX642, and RTX430) proteomes using BLAT (Kent 2002) with parameters -t = dnax -q = dnax -minScore = 200 and chimera_braker tool came along with necklace pipeline. Genome-guided and de novo transcripts were clustered into groups with parameters -minScore = 200 -minIdentity = 90 using blat. Finally, a superTranscript representing each cluster/group was constructed using Lace (Davidson et al. 2017) with cluster and sequence information. The entire workflow for superTranscriptome construction was divided into three steps and automated with Bpipe (Sadedin et al. 2012).

Fig. 1
figure 1

a The necklace pipeline (Davidson and Oshlack 2018) used in the present analysis integrates reference-guided transcriptome assembly and de novo transcriptome assembly to get a complete protein-coding orthologs. b Sweet sorghum superTranscriptome constructed by using necklace pipeline identified with three types of sequences: (1) Annotated on sweet sorghum reference genome (2) Annotated on grain sorghum (3) Unannotated

superTranscriptome Quality

The superTranscriptome assembly quality was analyzed with TransRate (Smith-Unna et al. 2016) using sequence information to obtain N50 values. Additionally, superTranscriptome constructed with 98%, 90%, 80%, and 70% sequence identities were checked with TransRate to see whether clustering influences contiguity. The superTranscriptome and six sorghum reference genomes were further searched for complete, partial, and missing gene orthologs within the Poaceae family with the Poales database using BUSCO (Simão et al. 2015). Further, RNA-seq samples from Rio were aligned to superTranscriptome and six published sorghum genomes using HISAT2 (Kim et al. 2019) with default parameters to obtain percentage read coverage by each assembly.

Functional Annotation and Characterization of Unannotated Genes

The genome-guided assembly reported 13,815 new genomic regions on SSRG, which were first annotated with SwissProt (Boutet et al. 2016) for protein-coding genes using the Trinotate (Ghaffari et al. 2014) pipeline. The superTranscripts missing annotations from Swiss-Prot were searched for homology with RNAcentral (Sweeney et al. 2020) using BLASTN with parameters pid = 90 & qcov = 90 to report various mi-RNAs and sno-RNAs encoding genes responsible for alternative splicing and silencing. The superTranscripts that missed annotations from the above three approaches were evaluated for coding potentials using CPC2 (Kang et al. 2017) and characterized into putative protein-coding and non-coding categories. Finally, lncRNAs from putative non-coding transcripts were annotated with the PLncDB database using BLASTN (parameters pid = 90 and qcov = 90).

superTranscript-Based Genome Annotation and Gene Densities

The superTranscripts were used as c-DNA/ESTs to re-annotate 15 diverse sorghum genomes using the PASA gene structure annotation pipeline (Haas et al. 2008) through the alignAssembly method with blat and gmap aligners. To report novel genes, alternative splice sites, and gene structure updates, a superTranscript-based annotation comparison of 14 sorghum genomes with existing ones was done using the PASA annotComapre method. Using PASA, alternative splicing updates over 15 sorghum genomes with superTranscriptome were also reported. This identified transcripts with retained intronic regions, spliced exons, and splice site donors-acceptors. The genome-wide annotations, gene structure, and splicing updates were visualized as a bar diagram with the ggplot R package. The numbers of genes per chromosome for SSRG and superTranscriptome annotated SSRG were obtained and a bar diagram showing chromosome-wise gene counts was plotted with a custom R script. The superTranscripts were annotated on SSRG using the PASA pipeline (Haas et al. 2008) produced *.gff3 annotation file and was used to obtain gene densities for tile size 100 kb with feature exon by using GFFex function of RIdeogram (Hao et al. 2020) R package. Similarly, gene densities for SSRG were obtained using past gene annotation *. gff3 file from Phytozome. Finally, an ideogram highlighting transcribed regions across ten sorghum chromosomes with the overlay method was prepared.

Orthogroups Identification, Synteny Analysis, and Identification of De Novo Genes

The superTranscript-based gene models over 15 genomes and singleton genes were predicted for protein sequences using TransDecoder [https://github.com/TransDecoder/TransDecoder] with a minimum sequence length of 100. This gave several protein sequences per gene. Therefore, the longest protein sequence per gene across all cultivars was selected for orthogroup identification. Finally, orthogroup identification was made using OrthoMCL (Li et al. 2003) with BLASTP e-value < 10−5 and inflation factor = 1.5 as previously mentioned for sorghum pan-genome (Tao et al. 2021). Gene duplication events are major contributors to speciation, we used OrthoFinder (Emms and Kelly 2019) to report duplicated orthogroups across assemblies. Syntenic gene families across assemblies were identified using the MCScanX_h algorithm of MCScanX (Wang et al. 2012) with BLASTN e-value < 10−5 and -max_target_seqs = 5 for homology search. To report de novo genes in sweet sorghum we looked for syntenic matches with other organisms as described previously for rice (Zhang et al. 2019). Briefly, the synteny match of sweet sorghum with grain sorghum (McCormick et al. 2018), wild sorghum (353) (Tao et al. 2021), and several other C3 -C4 grasses such as Saccharum spontaneum, Panicum hallii, Setaria viridis, Zea mays, Setaria italica, Aegilops tauschii, Eragrostis curvula, Triticum aestivum, Oryza sativa, Brachypodium distachyon, Avena sativa, Hordeum vulgare, and model plant species Arabidopsis thaliana from the Gramene database [https://www.gramene.org/] (Gupta et al. 2016) were obtained using MCScanX. Syntenic genes with the mentioned organisms were marked as 1 and nonsyntenic as 0. This provided a binary matrix (0/1) used for pairwise distance calculations and phylogeny construction with the Unweighted Pair-Group Method with Arithmetic mean (UPGMA) clustering using Phylip (Mansour 2009). The genes with missing synteny match with the above organisms were noted as de novo genes. These de novo genes were further looked for ORFs using TransDecoder with the default setting. The transcriptional evidence for de novo genes was obtained based on DGE-assigned p-values using RNA-seq. The de novo genes were further investigated to determine if they originated from transposases, based on internal sequence matches of MULE, TNT, and RE transposases with them using BLASTN (e-value < 10–37) as described previously for rice (Juretic et al. 2005).

Identification of Core, Dispensable, and Private Genes in the Population

superTranscripts locations were annotated on 15 sorghum genomes using the PASA pipeline (F. et al. 2006; Haas et al. 2008). The superTranscripts shared locations on any genomes using the PASA pipeline were marked as 1 else 0. The core, dispensable, and cloud genes were characterized based on their presence/absence on several genomes, a similar approach that was previously described (Jobson and Roberts 2022). Briefly, superTranscripts were categorized as core if they shared locations across all 15 genomes, dispensable if they shared locations across 4–14 genomes, and cloud if they shared locations across 1–3 genomes. Private superTranscripts were those that were exclusive to a single genome. The proportion of core, dispensable, and cloud genes per genome was obtained and visualized as a bar diagram, as mentioned in the past investigation on the sorghum pan-genome (Tao et al. 2021). The superTranscripts that did not share any locations on the above-mentioned 15 sorghum genomes were marked as orphan genes because they lacked homologs or had partial homologs on those genomes (Yao et al. 2017).

Origin of Sweet Sorghum-Specific Genes

We assigned ages to the genes exclusive to sweet sorghum genomes namely Rio, Chinese amber, Leoti and Wray. The pairwise sequence alignments of these genes with c-DNA sequences of grain sorghum (McCormick et al. 2018), wild sorghum (353) (Tao et al. 2021), several other C3-C4 plant outgroups, and Arabidopsis was performed using blat (pid = 60 and minscore = 100). In pairwise sequence alignment, genes were classified based on their hits with target species: those showed hits with Arabidopsis sequences were categorized as belonging to core angiosperm or Mesangiospermae; hits with Hordeum, Oryza, or any other C3 grasses were classified as Poaceae origin; hits with Saccharum, Panicum, or any other C4 grasses were identified as Andropogoneae origin; sequences aligning with grain or wild sorghum were categorized as sorghum-specific or Sorghinae origins; and If there were no hits then the gene was designated as sweet sorghum-specific. These given hits matrix (0/1) for all organisms which is further used for pairwise distance calculations and UPGMA phylogeny construction using phylip (Mansour 2009).

Differential Exon Usage (DEU) Analysis

The DEU comparison was performed as per the methodology described (Davidson et al. 2017). The RNA-seq reads of sweet sorghum Rio from past studies (Cooper et al. 2019) were aligned to SSRG and superTranscriptome using HISAT2. Novel splice sites were extracted for superTranscriptome to understand blocking within the superTranscripts. The *.gtf file for superTranscript blocks was prepared with the make_block tool in the necklace pipeline. The exon bin or blockwise expression count was obtained using FeatureCouts, (Liao et al. 2014). Statistical significance for exon usage was tested with DEXSeq, (Anders et al. 2012) R package. Per gene q-values for differential exon usage were obtained for both approaches. Once, per gene q-values were obtained, the true positives and true negatives were selected based on q-value cutoffs. The superTranscripts with a q-value below 0.05 were considered true positive, and superTranscripts with a q-value of 0.9 and above were considered true negative. The Sci-Kit-learn Python module was used to train the datasets with a logistic regression method. This helped to report the differences between actual and predicted labels for the above two approaches using machine learning. The ROC curve showing DEU performance was plotted with a custom R script. Similar to this, the KNeighborsClassifier method of the Sci-Kit-learn module was used to train datasets and prepare the confusion matrix for both. A DEU analysis of Rio and PR22 internode was also done to report differentially spliced transcripts between these two genotypes during internode growth.

Differential Gene Expression Analysis

To report changes in differential gene expression with two different references, i.e. SSRG, and superTranscriptome, the RNA-seq reads from leaf, meristem, and internode tissues from six-time points of sweet sorghum Rio were aligned to both, followed by expression counts using FeatureCounts (Liao et al. 2014) with a transcript count method. Finally, time point transcript count normalization and log2fold changes were estimated with the Likelihood Ratio Test (LRT) method using the DESeq2 (Love et al. 2014) R package. The volcano plots representing DGE for SSRG and superTranscriptome were plotted using the EnhancedVolcano (Blighe K et al. 2022) R package. The top 50 highly expressed genes with a P value < 0.05 from the above two methods were visualized as a heat map with the geom_tile function of the ggplot2 (Gómez-Rubio 2017) R package.

Similarly, differential gene expression analysis of Rio and PR22 during internode growth was done using superTranscriptome as a reference. The top 50 highly expressed genes between these two genotypes during internode growth were visualized as a heatmap. The gene ontology enrichment analysis was performed to report significantly enriched GO terms with AnnotationForge (Pagès et al. 2022) based custom database using superTranscriptome annotations. Significantly enriched GO terms (P value < 0.05) between Rio and PR22 during internode growth were reported using the clusterProfiler (Wu et al. 2021) R package.

Identification of Agronomically Essential Genes and Phylogenetic Analysis

The agronomically essential genes involved in DNA packaging, transcription, sugar transport, sugar metabolism, and stress responses were selected for MSA. These include genes encoding for histones, cytochrome b5, hexokinases, zinc finger domain, MULE-like transposase, NAC, MYB, SWEET and SUT transporters, Invertase (INV), expansin, Universal Stress Protein (USP), etc. The phylogenetic analysis was performed to report sequence level changes across these genes in sweet sorghum in comparison with grain sorghum, other C4-C3 outgroups, and Arabidopsis. The shared ortholog sequences for these genes with grain sorghum, Saccharum sp., maize, barley, and Arabidopsis were obtained from Phytozome [https://phytozome-next.jgi.doe.gov/]. The multiple sequence alignment of shared orthologs was done using MUSCLE (Edgar 2004). To refine alignment, the alignment trimming was performed using trimAl (Capella-Gutiérrez et al. 2009) with parameters (-gt 0.5 -resoverlap 0.5 -seqoverlap 50). Finally, a maximum-likelihood phylogenetic analysis was performed using IQ-tree (Nguyen et al. 2015) with standard substitution model selection (-m TEST) for 1000 bootstrap repetitions. The phylogenetic tree for these genes was visualized and annotated using iTOL (Letunic and Bork 2021).

Genomic Landscape of Sweet Sorghum Evolution

Through the utilization of ScanPAV (Giordano et al. 2018), we conducted a comparison of the genomes of grain (BTX623) and wild sorghum (353) to SSRG. Our findings revealed significant deletions (> 1 kb) that hold evolutionary significance. ScanPAV uses a sliding window (SW) approach for aligning sequences to reference using BWA (Li and Durbin 2010), reports missing blocks in reference i.e. > 1 kb, and writes missing blocks as well as their genomic coordinates in *.fasta and *.vcf file format. To characterize, genic and intergenic deletions in grain and wild sorghum, missing blocks with SSRG were looked for protein-coding gene overlap using BEDTools (Quinlan and Hall 2010). Similarly, missing regions on cultivated sorghum i.e. Rio and BTX623 genomes compared to wild sorghum were identified. This reported several large deletions that occurred in cultivated sorghum compared to wild sorghum throughout sorghum domestication.

Results

superTranscriptome Construction

Collectively, 223 RNA-seq accessions of seven diverse sweet sorghum genotypes from the public repository formed 45,049 genome-guided transcripts with SSRG and 886,115 de novo transcripts, constituting 45,864 gene clusters. The representative sequence of each cluster was prepared using Lace (Davidson et al. 2017). The superTranscriptome reported 45,864 genes with three types of sequences: genome-guided sequences, which were already annotated on SSRG; unannotated or novel transcribed genomic regions, which require further functional annotation; and de novo spliced isoforms or novel genes, which showed less than 90% similarity with SSRG genome-guided contigs but they were annotated on grain sorghum cultivars genomes such as BTX623, BTX642, and RTX430. The third sequence types were Open Reading Frames (ORFs) from transcripts using protein sequences retrieved from Phytozome [https://phytozome-next.jgi.doe.gov/] (Fig. 1b). This approach yielded 886,115 de novo transcripts but only 163,651 were reported with ORFs. Of these 163,651; 161,712 successfully incorporated known protein-coding superTranscripts from SSRG to retrieve complete protein-coding sequence ortholog. The remaining 1,939 de novo transcripts did not show significant similarity with any genome-guided transcript formed 815 separate clusters. This suggests there were some genes in sweet sorghum from grain sorghum background exclusively. The superTranscripts number may vary when we change the clustering parameters; it leads to the loss of some annotated genes from the reference genome. However, no significant change in contiguity was reported when we changed the clustering parameters (Supplemental Table S2).

superTranscriptome Quality

Contiguity analysis for superTranscriptome was done using the TransRate tool. This reported the N50 value of 3,551 bp for superTranscriptome. We found 30,697 (66.93%) and 572 (1.24%) superTranscripts were above 1 kb and 10 kb sequence lengths respectively. Only 184 (0.4%) superTranscripts were below 200 bp in sequence length (Table 1; Supplemental Data S2) and the rest 14,411 (31.42%) sequences showed lengths ranging from 200–800 bp. The result suggested that most sequences were contiguous, could be complete gene orthologs, and highlighted the diversity of superTranscript length distribution (Supplemental Figure S2a). BUSCO assembly completeness analysis suggested that 94.19% of superTranscripts shared complete orthologs with Poales and a 6% duplication rate. Therefore, duplication could be the reason for reporting more orthologs for some essential genes (Fig. 2). The superTranscriptome also reported incomplete/partial genes in comparison with Rio, SC187, BTX642, RTX430, and BTX623 sorghum genomes, suggesting that there were specific genes in the population with missing start/stop codons or both codons that play an essential role in trait development. To check the quality of the assembly, 48 RNA-seq accessions of sweet sorghum Rio were aligned to superTranscriptome along with six published sorghum genomes using HISAT2 (Kim et al. 2019). We reported that SSRG showed the highest average read alignment (84.06%) among the six genomes, followed by grain sorghum genomes BTX623 (82.29%), BTX642 (82.07%), SC187 (82.06%), RTX430 (81.87%), and superTranscriptome (79.18%) in that particular order. Results suggest that an average of 5% read coverage was reduced in superTranscriptome over SSRG (Supplemental Figure S2b).

Table 1 Assembly statistics with TransRate for the sweet sorghum superTranscriptome showing several sequences, the total number of bases, the contiguity of the sequences, mean contigs length, smallest and largest sequence in assembly etc
Fig. 2
figure 2

BUSCO assembly completeness analysis with Poacecae database by using superTranscriptome and six published sorghum reference genomes showing complete (single copy/duplicated), fragmented and missing genes

Functional Annotation and Characterization of Unannotated Genes

The sweet sorghum superTranscriptome comprises 45,864 genes. Among these, 31,234 were annotated with SSRG, 815 with grain sorghum, and 13,815 remain unannotated. Within this unannotated group, the Trinotate pipeline identified and annotated 2,802 new protein-coding genes/superTranscripts with SwissProt. Among these newly annotated protein-coding genes, 559 encodes for various transcription factors (TFs) including NAC, MYB, and chromo-domain proteins as determined by their DNA-binding domains using PLantTFcat (Dai et al. 2013), PLantTFDB (Guo et al. 2008) and iTAK (Zheng et al. 2016) online server databases (Supplemental Data S3, sheet2). The rest of the 11,013 sequences, 572 genes/superTranscripts, were annotated with diverse roles in intron splicing, gene silencing, and ribosomal assembly when searched for non-coding RNA annotation against the RNAcentral database. The remaining 10,441 superTranscripts were checked for coding potential with CPC2 and reported 954 protein-coding and 9,487 non-coding regions. Putative 9,487 non-coding genes/superTranscripts were potential sources of lncRNAs, of which 6,516 were annotated with PLncDB (Jin et al. 2021) using BLASTN (Fig. 3; Supplemental Figure S3, Supplemental Data S3). The results suggest that about 21.93% of sweet sorghum genomes transcribe non-coding RNAs involved in diverse functions, including gene silencing, intron splicing, and gene regulation. Long non-coding RNAs occupied 14% of total expressed sequences; however, this proportion may increase when an independent study is performed.

Fig. 3
figure 3

Donut plot showing functional annotations of superTranscriptome with proportion of annotated sequences with different databases

superTranscript-Based Genome Annotation and Gene Densities

Of a total of 45,864 genes/superTranscripts, 41,169 (89.76%) were annotated on 15 diverse sorghum genomes using the PASA gene structure annotation tool (Haas et al. 2008) and 4,695 (10.24%) remains singleton. These genes were selectively transcribed across sorghum genotypes and contributed to variability. In comparison to other genomes, the SSRG recorded the highest number of annotated genes 40,901(89.17%). This was much higher in contrast to the previous study on sweet sorghum genome annotation (Cooper et al. 2019). Additionally, chromosome-wise gene counts and gene density in SSRG were reported higher when annotated with superTranscriptome using PASA (Supplemental Figure S4a, b); Supplemental Data S4). The superTranscripts showed several gene structure updates on 14 diverse sorghum genomes (excluding wild progenitor 353). The updates include new gene additions, single gene model updates, and alt-splice site additions. Additionally, the superTranscript-based approach reported extensive alternative splicing on these 15 cultivar genomes. The results demonstrate that superTranscriptome improved genome annotations, gene structures, and alternative splicing (Fig. 4; Tables 2, 3).

Fig. 4
figure 4

Bar diagram showing various types of gene structure updates on 15 published sorghum genomes when annotated with superTranscriptome using PASA pipeline

Table 2 PASA number of annotated features on 15 diverse sorghum genomes using superTranscriptome
Table 3 PASA alternative splicing updates on 15 diverse sorghum genomes when annotated using superTranscriptome; reported with several spliced intron-exons, suggests superTranscripts based approach for genome annotation provides a better understanding of gene structures

Orthogroups Identification, Synteny Analysis, and Identification of De Novo Genes

A total of 454,772 protein-coding gene models across 15 genome assemblies and 4,695 singletons genes were used for orthogroups identification. Of the total 459,467 genes, 448,293 (97.57%) were assigned with orthologroups and the rest of the 11,174 (2.43%) were unassigned groups. The present analysis reported a total of 44,222 orthogroups in the sweet sorghum (Supplemental Data S5 and Supplemental Data S5(a)). Similar results for orthogroups identifications were obtained previously for sorghum pan-genome analysis (Tao et al. 2021). Our OrthoFinder analysis reported several orthogroups that showed duplications on 15 cultivar genomes. This includes the top three namely RTX430 (3,001), BTX642 (1471), and Rio (792) with the highest number of duplicated orthogroups (Supplemental Figure S5a). This infers gene duplication is frequent in sorghum and cultivars RTX430, BTX642, and Rio were selected with duplicated genes during domestication. Of 540,286 superTranscript-based gene models across 15 genome assemblies, 516,825 (95.66%) were syntenic genes, forming 20,419 syntenic gene families. The results demonstrate that more than 90% of genes across 15 sorghum genome assemblies were collinear with few syntenic re-arrangements. This reported 1,422,576 syntenic gene pairs were located on the same chromosome or scaffold over 15 genome assemblies. However, 7,885 were reported with syntenic re-arrangements. We compared the synteny of sweet sorghum (SSRG) with wild progenitor (353) and grain sorghum (BTX623). This reported 108,856 syntenic gene pairs on the same chromosome/scaffold and 1,528 syntenic re-arrangements between these genomes. We have taken subsets of these syntenic genes and visualized them as ideograms (Supplemental Figure S5b); Supplemental Data S5(b)). The tandem duplications during synteny analysis were also reported these include several tandem duplicates across each assembly. Rio reported the highest number of tandem duplicates 2,154 followed by RTX430 (2,101), Wray (613), PI510757 (342), PI329311 (332), etc. (Supplemental Table S3). Most of the tandem duplications in Rio genes were related to the Gene Ontologies (GOs) such as protein binding, DNA binding, protein kinase activity, nucleic acid binding, catalytic activity, nucleotide binding, DNA-binding transcription factor activity, and metabolic processes (Supplemental Data S5, Sheet 3 and 4). This infers that duplication among these genes was frequent in sweet sorghum and could significantly contribute to its evolution.

To report de novo genes, 37,003 genes on SSRG were used for synteny comparison. We found that 34,288 were syntenic with grain sorghum, wild sorghum, other C3-C4 grasses, and Arabidopsis (Fig. 5). The remaining 2,715 were de novo genes because they lacked synteny with the above organisms. These 2,715 genes were evaluated for de novo ORFs and transcriptional evidence using RNA-seq. These reported 1,669 with complete/partial ORFs and 1,543 with transcriptional evidence in the leaf, meristem, and internode tissues of Rio. The minimum and maximum ORF lengths for de novo genes were 91 and 8,346 amino acids (aa) respectively. Of 1,669 de novo ORFs only 895 were transcribed in Rio and 774 were non-transcribed. A total of 1,046 de novo genes lack ORFs of which 648 were transcribed in Rio. This infers that further mutations may trigger their transnational activity because they lack a start codon for transnational activity. The rest of the 398 de novo genes were non-transcribed non-ORF types and could be pseudogenes (Supplemental Table S4). Most of the de novo genes encoded elements such as TNT-1, RE-1, RE-2, DUF domain, MYB/SANT-like domain, reverse transcriptase, zinc finger domain, core and linker histones, ribonuclease H, etc. (Supplemental Data S5(c)). A total of 173 de novo genes were found to retain internal sequences from MULE, TNT, and RE transposases when the internal sequences of these transposases were queried using BLASTN (e-value < 10–37) (Supplemental Data S5(c), sheet 3). This highlights the role of these transposases in de novo gene formation.

Fig. 5
figure 5

Synteny-based phylogenetic grouping of organisms using UPGMA clustering

Identification of Core, Dispensable, and Private Genes in the Sweet Sorghum

The Pan-genome analysis classified the gene families into core, dispensable, and cloud categories for many plants such as sorghum (Ruperao et al. 2021; Tao et al. 2021; Wang et al. 2021), rice (Sun et al. 2017; Zhao et al. 2018; Qin et al. 2021), maize (Hirsch et al. 2014), and pea (Yang et al. 2022). The superTranscripts annotated 15 diverse sorghum genomes based on genomic presence/absence and were classified into core (20,743), dispensable (18,915), and cloud (1,511) genes (Fig. 6a, b). The remaining 4,695 singleton superTranscripts were marked as orphan genes because they either lack homologs on the above 15 genomes or have only partial homologs. Collectively, 24,625 (53.69%) superTranscripts were reported with remarkable presence/absence variation on 15 sorghum genomes (excluding single genome superTranscripts and core genes), suggesting that these genes contributed to variability in sorghum (Supplemental Data S6, Sheet 2). The GO enrichment analysis of these genes suggests that they encoded various Transposable Elements (TEs), proteolytic enzymes, and regulatory elements of biological processes. This suggests that proteolysis, transposition, and gene regulation of several biological processes were significant contributors to the observed variability in sorghum (Supplemental Figure S6a). Additionally, some superTranscripts were exclusively located on specific cultivar genomes (Fig. 6c), suggesting that sweet sorghum carries some genes from cellulosic, grain, and forage backgrounds; however, the function of these genes in sweet sorghum was not known and might carry some modifications. Phylogenetic analysis based on gene presence/absence across 15 cultivar genomes was performed using the UPGMA clustering previously described for genome-wide SNPs and PAVs (Tao et al. 2021). This reported the PAV-based grouping of 15 genotypes according to their geographical origin. We found that US-origin (Rio and Wray) and Asia-Europe origin (Chinese amber and Leoti) sweet sorghum formed a group; suggesting that the gene set carried by sweet sorghum varies with geographical origin (Fig. 6d). Further, 301 superTranscripts were exclusive to sweet sorghum genomes (Rio, Wray, Leoti, and Chinese amber). These include genes encoding hexokinases, core and linker histones i.e. H2A, H3, H4, and H1, cytochromes, MULE-like transposase, glucosyl transferases, chitinases, cell wall-associated receptor kinases, MYB-TF, and some known/novel lncRNAs. Gene ontology studies of these genes showed that most of them were connected to the transport of different kinds of molecules across membranes, like electrons, metal ions, organic and inorganic ions, and more. They were also linked to catalytic enzymes, cell signaling, gene regulation, chromosome assembly, and DNA packaging. The results showed that sweet sorghum has extra alleles for maintaining cellular homeostasis, managing energy and carbohydrates, controlling chromatin, and DNA packaging (Supplemental Figure S6b; Supplemental Data S6, Sheet 3). These genes were exclusive to sweet sorghum genomes and probably evolved later. The orphan genes contribute 9.64% of superTranscriptome for which KEGG Orthology (KO) analysis was performed using KAAS (Moriya et al. 2007). The majority of orphan genes were coded for TFs, signaling molecules, transporters, and catalytic enzymes that play a significant role in cellular physiology, cell wall development, abiotic and biotic stress resistance and probably help in the development of sweet sorghum-specific traits (Supplemental Figures S6c, d).

Fig. 6
figure 6

a Pie chart showing number of core, dispensable and cloud genes identified on 15 published sorghum genomes. b Bar diagram showing proportion of core, dispensable and cloud genes across each sorghum genome when annotated with superTranscriptome using PASA. c Flower plot showing distribution of core, shell and private genes across 15 diverse sorghum genomes; helps to report sweet sorghum specific genes and genes from various cultivar background in sweet sorghum. d Unweighted Pair-Group Method with Arithmetic mean (UPGMA) based phylogenetic grouping of 15 cultivars based on gene presence/absence reveals that Asia-Europe and the US-origin sweet sorghum carry distinct sets of genes

Origin of Sweet Sorghum-Specific Genes

Of 301 sweet sorghum-specific genes, only 25 were reported hits with Arabidopsis; infer that these sequences have Mesangiospermae origin. A total of 185 sequences reported hits with C3 grasses of which 160 were added over Arabidopsis, suggesting that those sequences were exclusively from the Poaceae family. Next, 238 sequences showed hits with C4 grasses of which 53 were added over Poales, indicating that those 53 sequences have Andropogoneae origin. A total of 270 sequences were reported hits with wild sorghum (353) and 283 with grain sorghum (BTX623). These include 32 and 13 added sequences to wild and grain sorghum respectively. This demonstrates that those sequences were sorghum-specific or Sorghinae origin. The remaining 18 were specific to sweet sorghum, did not display any hits, and encoded histone H1/H5, unknown proteins and non-coding genes. This infers that these 18 genes were selected in sweet sorghum (Figs. 7a, b); Supplemental Data S7a). Interestingly, out of these 301 sweet sorghum-specific genes, only 58 were identified as de novo genes and assigned gene ages (Supplemental Figure S7; Supplemental Data S7b). Based on the taxonomic age of the target organism or family, the ages of all 301 genes were determined using the TimeTree5 database (Supplemental Table S5) (Kumar et al. 2022). We found that a total of 270 (89.70%) genes have ages > 9 Million Years (MYA), while the rest of the 31 (10.30%) have ages < 3.55 MYA evidenced that they were evolved in sorghum.

Fig. 7
figure 7

aThe 301 genes were exclusive to sweet sorghum genomes, assigned gene ages based on their sequence similarity hits with c-DNA sequences of several C3-C4 grasses and Arabidopsis. b Sequence similarity hits (0/1) of 301 sweet sorghum-specific genes with several other C3- C4 grasses and Arabidopsis facilitated phylogenetic grouping based on shared genes using UPGMA clustering

Differential Exon Usage (DEU) Analysis

The traditional genome-based approach with a standard blocking scheme reported only 148,847 exons on SSRG. superTranscripts followed a dynamic blocking scheme when aligned with RNA-seq reads which reported 461,355 exon bins, suggesting that exon splicing was extensive in superTranscripts. For example, the standard blocking scheme employed for DEU analysis with SSRG yielded five exons for the NLP2 TF coding gene which was located on chromosome 6 (SbRio.06G148100), of which three were differentially used for spliced transcript formation under six developmental stages. Where superTranscriptome followed a dynamic blocking scheme for the same gene (NLP2 TF, gene id: SbRio.06G148100), it identified 19 different exon bins, of which eight were differentially used for transcript formation (Supplemental Figures S8a, b); Supplemental Data S8). This suggested that superTranscript-based dynamic blocking was more informative for the same gene than the SSRG-based standard blocking. Additionally, this increased the probability of finding more alternatively spliced transcripts; those may be involved in adaptation, stress responses, and trait development. The DEU testing using two references i.e. SSRG and superTranscriptome, reported 20,942 and 22,554 genes with differential exon usage. The true positives (with q-value < 0.05) and true negatives (with q-value > 0.9) reported with SSRG and superTranscriptome were 12,793 (61.08%), 4,583 (21.88%) and 13,556 (60.10%), 5,749 (25.48%) respectively. The superTranscriptome-based approach found a better classifier for true labels than the SSRG when trained datasets using the logistic regression method (Fig. 8a). Additionally, the confusion matrix prepared for the above two approaches using the KNeighborsClassifier method reported more true labels in superTranscritome than the SSRG-based standard approach (Fig. 8b, c)). The results established that the genes identified by the superTranscriptome were real transcripts, as shown by their ability to give more accurate estimates of DEU when tested with supervised machine learning and differential splicing. The DEU analysis of Rio and PR22 during internode growth reported that 8,787 and 5,769 genes were differentially spliced (q-value < 0.05) in Rio and PR22 (Supplemental Figures S8c, d). This clearly showed that the differential splicing of the genes plays a crucial role in phenotype development.

Fig. 8
figure 8

a ROC curve for trained datasets with logistic regression method showing superTranscritptome with dynamic blocking giving better estimates of DEU than SSRG-based standard blocking approach. b Confusion matrix prepared for DEU with SSRG showing true and predicted labels when trained datasets using KNeighborsClassifier method. c Confusion matrix prepared for DEU with superTranscriptme reported more true labels than SSRG-based approach when trained datasets using KNeighborsClassifier method

Differential Gene Expression (DGE) Analysis

Gene-level abundance estimates and DGE analysis using SSRG and superTranscriptome as a reference reported a total of 20,940 and 22,036 genes respectively, with significant (P value < 0.05) changes in gene expression across the leaf, meristem, and internode tissues over six-time points (Figs. 9a, b); Table 4; Supplemental Figure S9a). Among 50 highly expressed genes in the leaf, meristem, and internode tissues, 83% were the same for the above two references, and the rest of the 17% genes showed slightly altered gene expression (Supplemental Figures S9b, c). Past studies reported minor differences in gene expression when performed de novo and reference-based DGE analysis (Davidson and Oshlack 2014); suggesting that the choice of references for DGE analysis has little influence on gene expression. The ROC curve based on trained datasets using the logistic regression, showed that superTranscriptome gave better DGE estimates than SSRG (Supplemental Figure S9d). This infers that the superTranscript-based approach identifies more truly expressed genes than SSRG and highlights its suitability for DGE analysis. The list of the top 50 highly expressed genes during internode growth differed between Rio and PR22. Rio highlighted several Light Harvesting Complexes (LHCs) but did not show expression of metallothionein-II within the top 50 genes. Conversely, PR22 showed a lower number of LHCs but expressed metallothionein-II. This suggested that the internode phenotype is likely influenced by active photosynthesis and metal ion transport (Fig. 9c). Further, the growing meristem of Rio reported expression of metallothionein-II indicated that this gene appears to be essential during the active growth of the meristem but not necessarily required for the sugar-accumulating internode. Differential gene expression showed that GO terms related to moving metal ions, secondary metabolites, cells, and other things were more common in PR22 during internode growth (Supplemental Figure S9e). Therefore, active transport of metal ions, secondary metabolites, and organic compounds contributes to dry or pithy stems in sorghum. Further, both genotypes reported 9,028 common genes that showed differential (P value < 0.05) expression during internode growth; however, the expression of 6,092 and 2,840 genes were exclusive to Rio and PR22 respectively. (Supplemental Figure S9f). Interestingly, 487 and 348 lncRNAs were also reported to be differentially expressed during the internode growth of Rio and PR22 respectively. It is inferred that the lncRNAs could be hidden players that control internode development and confer sugary internode (Supplemental Figures S9g, h)). Further, the top 50 highly expressed lncRNAs and TFs were listed and their expression patterns were correlated with percentage (%) brix readings; which were taken previously during internode growth for both genotypes (Cooper et al. 2019). This reported several lncsRNAs especially novel lnc SBRIO_LNC019054 along with NAC-2–1 and NAC-60 showed elevated expression with sugar accumulation in Rio; suggesting that they were positive regulators of sucrose-related genes. However, during the internode growth of PR22, it was reported that FHA-SMAD-21, Tify-13, and conserved SBIC_LNC004344 showed higher expression with decreasing sucrose levels. This implies that they may function as negative regulators of sucrose (Supplemental Figures S9i, j).

Fig. 9
figure 9

a Volcano plot showing differentially (p-value < 0.05) expressed genes across Leaf, Internodes and Meristems tissues when used SSRG as a reference for DGE analysis. b Volcano plot reported more differentially (p-value < 0.05) expressed genes across Leaf, Internodes and Meristems tissues when superTranscriptome used as a reference for DGE analysis. c Top 50 highly expressed genes in Rio and PR22 during internode growth shows that Rio reported comparatively more numbers of expressed LHCs along with missing expression of Metallothionein II

Table 4 Differential Gene Expression (DGE) analysis performed using DESeq2 with two different references reported number of genes showed significant (P value < 0.05) changes in expression across leaf, meristem and internode tissues of sweet sorghum Rio

Identification of Agronomically Essential Genes and Phylogenetic Analysis

The agronomically important genes such as histones, MULE-like transposase, zinc finger domain, hexokinases, cytochrome b5, SWEET, SUT, invertase, expansin, USP, NAC, and MYB transcription factor reported more in number in sweet sorghum than grain sorghum and some of them were exclusive to sweet sorghum genomes. We found the majority of novel histone genes either formed separate clades or clades with Hordeum sp. (Supplemental Figures S10a–d). Further, the chromosomal locations of these novel histone genes on SSRG showed that they were located at the lower arm of chromosome 8; suggesting that this region of chromosome 8 carries duplications of histone genes in sweet sorghum (Supplemental Figure S10e). The hexokinases, cytochrome b5, and MYB transcription factor genes from sweet sorghum were found to be related to sorghum and Saccharum sp. This suggests that these genes may be ancestral copies of genes from sorghum and sugarcane (Supplemental Figures S10f–h). MULE-like transposases were abundant in the sweet sorghum genome, of which SwSb − MULE − 21** was reported to be exclusive to sweet sorghum genomes and found similar to existing ones in sweet sorghum (Supplemental Figure S10i). This inferred that MULE genes might have duplicated on sweet sorghum genomes. The zinc finger domain is a major transcription factor family associated with abiotic stress responses in plants (Han et al. 2020). Of several, five zinc finger loci namely SwSb-zf-2**, SwSb-zf-3**, SwSb-zf-4**, SwSb-zf-5**, and SwSb-zf-6** were exclusive to sweet sorghum and showed similarity with existing sweet and grain sorghum genes (Supplemental Figure S10j). This demonstrated that some zinc finger genes were also duplicated in sweet sorghum. The SUT4, one of the six SUT genes was identified in sweet sorghum. This gene was reported to be deleted on SSRG due to large deletions (Cooper et al. 2019). The present analysis reported the presence of a SUT4 ortholog named SwSb − SUT − 6* through de novo transcriptome assembly (Supplemental Figure S10k). The Sugars Will Eventually Exported Types of Transporters (SWEET) is a gene family involved in phloem loading and unloading. The superTranscriptomic approach reported 24 SWEET genes with one additional SWEET locus SwSb − SWEET − 24* located on chromosome 8 with only one exon (Supplemental Figure S10l). The plant-specific NAC transcription factor qualitatively controls dry biomass, sugar production, and grain yield (Xia et al. 2018). superTranscriptome identified 130 NAC-TF encoding loci, of which six were newly reported on SSRG. The new NAC loci were more similar to ancestral types i.e. sorghum, Saccharum sp., and maize than other outgroup species (Supplemental Figure S10m). Further, the rest of the newly annotated genes which include invertase, expansin, and USP formed separate clades in phylogeny; suggesting that these sequences carry some evolutionary modifications in sweet sorghum (Supplemental Figures S10n–p).

The numbers and chromosomal locations of agronomically important genes on SSRG and grain sorghum reference genomes (BTX623) reported that these loci were more abundant in sweet sorghum than grain sorghum (Figs. 10a, b). This suggests that sweet sorghum carries additional loci for agronomically important genes. We find more and fuller gene orthologs for agriculturally important genes by combining de novo transcriptome assembly and genome-guided assembly methods. Nevertheless, we were unable to validate the functions of newly discovered genes in sweet sorghum. Therefore, further GWAS analysis may provide more information about these genes.

Fig. 10
figure 10figure 10

a Genomic locations agronomically important QTLs such as histone, cytochrome, hexokinases, MULE-like transposase and MYB-TF on grain sorghum genome. b Genomic locations agronomically important QTLs such as novel histones, cytochrome, hexokinases, MULE-like transposase and MYB-TF on sweet sorghum genome

Genomic Landscape of Sweet Sorghum Evolution

Compared to sweet sorghum (SSRG), ScanPAV reported 14,293 (31.6 MB) and 20,869 (47.99 MB) large deletions (> 1 kb) in grain (BTX623) and wild (353) sorghum respectively. These include the loss of 2,974 genes in grain sorghum and 5,028 genes in wild sorghum. This indicated that wild sorghum reported with greater number of deletions in both gene and intergenic regions than grain sorghum. Interestingly, 58 and 51 sweet sorghum-specific or private genes were reported with deletions in wild and grain sorghum respectively (Table 5ii). The results validated that sweet sorghum-specific genes showed deletions in grain and wild sorghum; suggesting that insertions for these genes were selected in sweet sorghum. Sequence loss in cultivated sorghum i.e. grain and sweet sorghum compared to wild sorghum accounted for 20,495 (47.94 MB) and 20,501 (48.82 MB) large deletions in grain sorghum and sweet sorghum respectively (Table 5i). This included deletions of 2,394 and 2,177 genes in grain and sweet sorghum respectively (Table 5iii). This infers that a comparatively higher number of gene deletions (2,394) were found in grain sorghum than sweet sorghum (2,177); suggesting that more genic deletions were selected in grain sorghum. Whereas, sweet sorghum showed fewer genic (2,177) and a comparatively higher number of intergenic deletions (18,324) than grain sorghum (18,101); inferred that more intergenic deletions were selected in sweet sorghum. The results suggest that both deletion and insertions were frequent in cultivated sorghum and selections of genes with SVs lead to subtype development.

Table 5 ScanPAV results show variable regions (MB) and number of genes reported with presence/absence over three genome assemblies i.e. SSRG, Sorghum Reference Genome (BTX623), and wild progenitor (353)

Dry locus or D locus (Sobic.006G147400) encodes plant-specific NAC transcription factor which qualitatively controls midrib color and stem composition in sorghum (Xia et al. 2018; Zhang et al. 2018). The present analysis reported large deletions in the D locus of sweet sorghum compared with grain and wild type. This includes 2.05 kb and 2.29 kb deletions in the D locus of sweet sorghum compared to wild and grain sorghum respectively (Supplemental Figures S11a–c). In comparison with wild sorghum, a total of 3 and 5 NAC encoding loci (including the D locus) were deleted in grain and sweet sorghum respectively (Table 5iii; Supplemental Data S11). This includes an additional four NAC encoding loci other than the D locus located on chromosomes 1, 2, 4, and 6 reported with deletions in sweet sorghum (Supplemental Figures S11d–h). We found that these four NAC encoding loci were affected by terminal deletions and could be non-functional genes due to the loss of upstream/downstream promoter regions. Results demonstrated that sweet sorghum NAC encoding loci were affected by deletions compared with wild progenitor (353). This confirms that deletions in NAC encoding genes could be one of the reasons for trait development in sweet sorghum.

Discussion

superTranscriptome identified a total of 45,864 genes, which was higher than the previously reported in sorghum (McCormick et al. 2018), sweet sorghum (Cooper et al. 2019), sorghum pan-genome (Ruperao et al. 2021; Tao et al. 2021; Wang et al. 2021), rice pan-genome (Zhao et al. 2018; Qin et al. 2021), barley (Jayakodi et al. 2020), and much lesser than maize pan-genome (Hufford et al. 2021). The results validated that the superTranscriptome approach identified more expressed genes and suggested that a single reference genome cannot report all genes expressed in sweet sorghum. Since superTranscripts are probabilistic gene models for sweet sorghum, their functions may vary from genome to genome depending on the gene structure. The N50 value for superTranscriptome was 3551 bp, much higher than the N50 value previously reported for pan-transcriptome assemblies of maize (Hirsch et al. 2014; Jin et al. 2016), alfalfa (Medina et al. 2021), potato (Petek et al. 2020), and barley (Ma et al. 2019) and showed the increased contiguity in superTranscriptome assembly. BUSCO assembly completeness studies reported several gene duplications and missing/partial genes that could promote trait development in sorghum. Past studies on sorghum pan-genome reported BUSCO gene completeness ranging from 94.4 to 98% in the sorghum (Tao et al. 2021), supporting our findings. Past results showed partial genes were associated with the sugarcane aphid (SCA) resistance for the cultivated sorghum variety TX278 (Wang et al. 2021). Furthermore, a comparative analysis of maize genomes also reported that more incomplete genes and missing genes contribute to maize adaptation under diverse environments and crop improvement (Yang et al. 2017; Li et al. 2019a). The comparative analysis of sweet and grain sorghum genomes reported 2,175 genome re-arrangements, these include several deletions and duplications along with 98,723 base substitutions (missense: silent ratio = 1:1) in coding regions (Cooper et al. 2019). This infers that the sweet sorghum genome carries several evolutionary changes during its domestication. The superTranscriptome reported an average of 5% less read coverage over SSRG; demonstrating that we lost an average of 5% transcribed sequences over SSRG during superTranscripts construction.

The lncRNAs contribute 14% of superTranscriptome. However, past studies on lncRNAs suggest they are highly tissue-specific and condition-specific (Statello et al. 2021). In the present analysis, we have reported some lncRNAs with significant changes in expression during the internode development. There were between 8000 and 23,309 lncRNAs found in maize (Wang et al. 2015; Lv et al. 2019), sorghum (Sun et al. 2020), rice (Zhou et al. 2021), and barley (Unver and Tombuloglu 2020) which showed that lncRNAs make up most of the RNA. Recent studies on ten high-quality genome assemblies of diverse bioenergy sorghum genotypes reported similar results for lncRNAs (Voelker et al. 2023). The present analysis reported 44,222 orthogroups in sweet sorghum which was similar to previously reported for sorghum pan-genome analysis (Tao et al. 2021). The genes in the population are most likely affected by SVs and duplication events during speciation, which results in the loss or formation of new orthogroups (Brasó-Vives et al. 2022; Prabh and Rödelsperger 2022; Berdan et al. 2024). Our synteny analysis revealed that Rio had the highest number of tandem duplicates (2,154), which indicates duplications were frequent in Rio and could be associated with the activity of various transposases. We reported 2,715 de novo genes in sweet sorghum, of which 648 (23.68%) were transcribed without ORFs. This suggests that the absence of start codons rendered them non-functional, implying that any mutation could potentially affect their functions. Further, 398 (14.65%) de novo genes were non-transcribed non-ORFs, evidencing they could be pseudogenes. We identified 173 de novo genes that likely originated from the activity of MULE, TNT, and RE transposases. The origins of the remaining de novo genes remain undetermined. The dispensable genes, a major player contributing to variability that codes for various Transposable Elements (TEs), proteolytic enzymes, and regulators of biological processes; their proportion was about 53.69% in the superTranscriptome. However, previous studies reported 63.6% of dispensable genes in sorghum compared with rice, soybean, and Brachypodium sp., suggesting that sorghum is more genetically diverse than other crops (Xin et al. 2021). The past studies on sorghum reported that Transposable Elements (TEs) may play a role in the gene content variation (Voelker et al. 2023) and this work substantiated it but processes, such as proteolysis and gene regulation, were not reported. Our findings showed that proteolysis and gene regulation contribute to the observed sorghum variability. Phylogenetic analysis based on gene presence/absence reported separate grouping of US and Asia-Europe origin sweet sorghum, highlighting geography-based selection of genes within sweet sorghum. Similar results were reported for gene PAV and SNP-based phylogenetic analysis of ten sorghum genome assemblies (Voelker et al. 2023). The 301 genes are exclusive to sweet sorghum genomes, including hexokinases, core histones, cytochromes, etc. These 301 were assigned gene ages based on their sequence similarity hits with c-DNA sequences of several C3–C4 outgroups and Arabidopsis which reported 18 were sweet sorghum-specific ones. The synteny-based comparison revealed that out of these 301, only 58 were de novo genes for which gene ages were determined. These genes were newly evolved in sweet sorghum. A recent study reported that some histones and unique chromatin remodeling factors were newly evolved in sorghum (Hu et al. 2022).

Our DEU analysis reported more true positive genes in the superTranscriptome-based approach than SSRG when datasets were trained with the KNeighborsClassifier method and gave better estimates of DEU when trained with the logistic regression method. superTranscriptome reported more differentially spliced genes; therefore, they were true transcripts. Additionally, the count-based method we used for gene-level abundance estimates and statistical inferences was superior to traditional TPM-based methods (Soneson et al. 2016). DEU analysis of Rio and PR22 during internode development suggested that differential splicing is extensive in Rio and could be one of the reasons for the trait development in Rio.

Using two different sources, superTranscriptome, and SSRG, we have reported more differentially expressed genes (P value < 0.05) with superTranscriptome than with SSRG. superTranscriptome gives better estimates of DGE than SSRG when trained datasets with the logistic regression method. This demonstrated superTranscriptome reduces bias caused by using a single reference genome for genetic analysis such as DEU and DGE (Davidson et al. 2017; Davidson and Oshlack 2018; Wang et al. 2023). DGE analysis of Rio and PR22 during internode development reported more LHCs and missing expression of Metallothionein-II in Rio under the Top 50 category, suggesting that higher expression of LHCs and lower expression of Metallothioneins contributes to sugary internode. Metallothioneins move metal ions around, like copper (Cu2+), and a lack of them causes metal ions to build up in different parts of plants (R. Benatti et al. 2014). The past studies on Chenopodium murale (Llerena et al. 2021), Colobanthus quitensis (Contreras et al. 2018), and sugarcane (Agarwala et al. 1993) reported copper concentrations stimulate sugar levels in various plant parts, indicating the presence of the metal ions inside the cell positively regulates sugar accumulation in internode tissues.

SUT, SWEET, MYB, and NAC were important gene families controlling internode development in sorghum (Mizuno et al. 2016; Zhang et al. 2018; Hennet et al. 2020). Past studies reported six SUT genes with elevated gene expression in sweet sorghum stem internodes when compared with grain sorghum (Li et al. 2014; Babst et al. 2021). In previous studies, SUT4 was reported with putative deletions in Rio (Cooper et al. 2019); our de novo transcriptome assembly analysis with 223 RNA-seq accessions reported its presence, suggesting that SUT4 is present in the sweet sorghum population. Several SUT genes were reported in maize (4–7)(Leach et al. 2017), rice (5)(Aoki et al. 2003; Hirose et al. 2010), wheat (4)(Deol et al. 2013), and barley (5)(Radchuk et al. 2017) with a variety of functions in growth and development. To date, 23 SWEET genes were reported in the sweet sorghum (Mizuno et al. 2016); however, SSRG reported only 21 SWEET genes, along with two with putative deletions, namely SWEET3–3 and SWEET8–2; but we reported 24 orthologs using a superTranscriptome-based approach. Interestingly, maize and foxtail millet also have the same number (24) of SWEET orthologs (Liu et al. 2022), suggesting that there could be a history of introgression of SWEET genes between sorghum, maize, and foxtail millets. Genome introgression between wild relatives and crops within the same family was reported in other crops also (Ellstrand et al. 1999; Hufford et al. 2013; Ananda et al. 2020). In sorghum, plant-specific NAC-TFs control a variety of traits, including high cellular biomass (Xia et al. 2018), insect-pest resistance (Zhang et al. 2013), and drought/salinity tolerance (Sanjari et al. 2019; Punia et al. 2021), etc. In this study, we reported six new loci for NAC genes. One of them was involved in secondary cell wall development, i.e., NAC-73–2* with remarkable PAV on 15 genotypes, proposing that this gene contributes to variability in sweet sorghum. The number of NAC genes identified by the superTranscriptome approach was comparatively less than previously reported in the grain sorghum genome (Kadier et al. 2017) and the loss of NAC genes might be the reason for sweet phenotype development. Novel histone genes were located at the lower arm of chromosome 8 on SSRG, suggesting that this region carries duplications of histone genes. The comparative genomics of sorghum and rice reported that 68.6% of genes were collinear between these two crops with several duplicated regions on chromosome 5 and 8 in sorghum (Paterson et al. 2009) which supports our findings. Most of our sweet sorghum-specific loci that encoded genes such as hexokinases, cytochrome b5, MYB, MULE-like transposes, and Zinc finger domain were greater in sweet sorghum than grain sorghum and formed clades with sorghum and sugarcane in the phylogeny; suggesting that these genes might be ancestral copies. Past studies reported that most of the sorghum duplications were derived from ancestral gene families and a new transposon family was a major contributor to recent duplication in sorghum (Guo et al. 2019). Interestingly, we found that MULE-like transposase genes were abundant in sweet sorghum, implying that this transposase family may contribute to ancestral gene duplications in sweet sorghum. However, the functions of those duplicated genes in sweet sorghum remained unknown. MULE-mediated duplication of host genes usually results in pseudogenes rather than new functional protein-coding genes (Juretic et al. 2005).

Our comparative analysis of one wild and two cultivated sorghum genomes using ScanPAV reported that both deletions and insertions were frequent during sorghum evolution. The dry gene or the D locus on chromosome 6 regulates internode growth and midrib color qualitatively and natural mutants for this gene produce sugary internodes (Zhang et al. 2018). We reported major deletions ~ 2 kb in the D locus in sweet sorghum (Rio) in comparison with the grain sorghum (BTX623) and wild type (353); suggesting that the partial D locus was a major player that governs internode phenotype. These findings were consistent with the earlier sorghum report on juicy grain type sorghum (BTX623) that carries ~ 1.8 kb deletions in the D locus when compared to dry grain type (Ji2731) (Zhang et al. 2018). We also identified four additional NAC loci in sweet sorghum, beyond the previously known D locus, that have been affected by terminal deletions compared to its wild progenitor (353). This suggests that deletions in NAC loci might play a significant role in the evolution of sweet sorghum. Previous research on the genomes of sweet and grain sorghum indicated that deletions and duplications could be found in a single gene or groups of related genes at the same time (Cooper et al. 2019). This suggests that deletions or duplications occurring in specific genes or clusters of related genes could potentially contribute to the differences observed between wild and cultivated sorghum.

Conclusion

The superTranscriptome identified 44,222 orthogroups in sweet sorghum consisting of 45,864 genes, with 301 genes unique to sweet sorghum. These genes are involved in functions such as chromatin organization, gene regulation, sugar metabolism, and cell wall synthesis. A gene age analysis suggested that 18 of these genes might be newly evolved in sweet sorghum. superTranscript-based Differential Exon Usage (DEU) and Differential Gene Expression (DGE) analysis provided more accurate estimates than traditional methods and showed potential as replacements. The DGE analysis of Rio and PR22 showed that the top 50 highly expressed gene lists differed between these two genotypes during internode growth. In Rio, several Light Harvesting Complexes (LHCs) were prominent, while metallothionein-II was absent in the top 50 categories. This suggests that active photosynthesis and metal ion transport play key roles in sugary internode development in Rio. Additionally, expression patterns of various long non-coding RNAs (lncRNAs) and transcription factors (TFs) were linked to sugar accumulation. The superTranscriptome also identified more orthologs for agronomically important genes and several tandem duplications across 15 genome assemblies. Rio reported the highest number (2,154) of tandem duplicates across all genomes, possibly due to transposase activity. Significant deletions (> 2 kb) in the D locus and terminal deletions in the other four NAC encoding loci were observed in sweet sorghum compared to its wild type (353) which could be one of the contributors to sweet sorghum evolution.