Introduction

Chloroplast is an organelle for photosynthesis and metabolic reactions in plants, it contains specific genomes (Chakraborty et al. 2020). The chloroplast genome is much smaller than the nuclear genome (Wilson et al. 2011), and has the characteristics of moderate molecular weight, easy sequencing, multiple copies, simple structure (Zhang et al. 2023), moderate nucleotide replacement rate of DNA, significant differences in molecular evolution rate between coding and non-coding regions, and good collinearity among various groups of chloroplast genomes (Parmar et al. 2022). The chloroplast has a set of maternally inherited chloroplast genomes independent of the nuclear genome (Sadhu et al. 2023). It has been widely used in plant species identification, genetic diversity analysis, and phylogenetic analysis (Finkeldey and Gailing 2013). In plant phylogenetics studies with the integration of phylogenetic and genomics, the advantage of chloroplast genome had gradually emerged, an increasing number of plant chloroplast genomes had been sequenced and assembled (Wanichthanarak et al. 2023), a great deal of genetic resources had been analyzed by different bioinformatics methods, it provides a new thought to solution the groups of classifying difficult (Simmonds et al. 2021). 61 codons are matching 20 amino acids in nature, due to the degeneracy of the genetic code, most amino acids are encoded by multiple codons, except methionine (Met) and tryptophan (Trp) by a single codon, and these codons encode the same amino acid are called synonymous codons (Li et al. 2016). In the absence of genetic mutation and natural selection, the synonymous codons should be used at the same frequency. Nevertheless, the synonymous codons of a particular amino acid may not be used at the same frequency, the phenomenon of codons being used unevenly is called CUB (Dehlinger et al. 2021). Now, the recognized hypothesis is that CUB is the result of a combination of three evolutionary forces: mutation, selection, and genetic drift (Morton 2022). In the current study, CUB has been detected in all species, and it has fundamental differences among the species with distant phylogenetic relationships (Somaratne et al. 2019). Therefore, using CUB in different species and selecting appropriate codons for transgene studies can improve the expression efficiency of exogenous genes and lay a foundation for subsequent studies on the molecular mechanisms of plant adaptation to environmental changes (Almutairi 2021). Numerous studies have confirmed that the applicability of CUB to the chloroplast genome level in plants. Therefore, the study of protein coding sequence characteristics and codon usage patterns in chloroplast genome will provide basic information for elucidating the phylogenetic relationship between plant species and chloroplast gene engineering.

Annonaceae is a primitive taxon closely related to angiosperms and is a prevalent family in tropical flora, it is renowned for its significant economic and medicinal importance. The largest family of Magnoliales is Annonaceae, which has approximately129 genera and over 2300 species (Jiang and Li 1979). Except for Asimina Adans was found in temperate regions, the rest genera were found in tropical and subtropical regions of the world and it was an important component of the ecosystem in lowland tropical rainforests. Annonaceae plants are also widely distributed in China, they are mainly distributed from Southwest to Taiwan, most of them are produced in South China, and a few are distributed in East China. Annonaceae has been identified as having 24 genera, 103 species, and 6 varieties of Annonaceae (Li and Michael 2012). Most of the fruit of this family is edible, and the main cultivated species are Annona squamosal L., Annona muricata L., and Annona reticulate L. (Nogueira et al. 2022). The wild species include the Fissistigma uonicum in Fujian Province, the Fissistigma oldhamii in Ganzhou region of Jiangxi Province, the Uvaria microcarpa and Uvaria grandiflora in Guangdong and Guangxi region, etc., which were mostly picked and eaten by local people or plant lovers (Lei et al. 2022). Although the resources of the Annonaceae are abundant on the whole, the studies on its adaptive evolution and phylogeny have been limited, so the species, population size and distribution, nutritional composition, and market demand of wild fruits of the Annonaceae have not been clearly understood, and it has not been effectively exploited and utilized.

In recent years, correlative studies have defined the overall phylogenetic relationships of the Annonaceae, but many are based on multiple plasmid DNA markers or DNA barcodes (Larranaga et al. 2019, 2022), it provided a good framework for studying patterns of interrelationships and diversity within the Annonaceae, but it is of low resolution compared to chloroplast genome data (Senapati et al. 2023). The chloroplast genome database of Annonaceae has been upgraded and enhanced through the use of high-throughput genome sequencing technology and sequence analysis methods, A total of 14 chloroplast genomes of Annonaceae have been published on the National Center for Biotechnology Information (NCBI, https://ww.ncbi.nlm.nih.gov), it accounted for 0.58% (Annonaceae more than 2,400 species), and the molecular evolution of chloroplasts in Annonaceae based on codon level has not been reported. In this study, the chloroplast genomes of 14 species were analyzed to reveal the differences in sequence characteristics and influencing factors of CUB in chloroplast genomes of these 14 species. To provide a theoretical basis and reference for the development and utilization of Annonaceae resources, phylogenetic relationships among related species, gene sequence evolution, species adaptive evolution, and classification of Annonaceae, and provide support for further research and application.

Materials and methods

Comparative genomic analysis

Chloroplast genome data of 14 species were collected from the NCBI database, including: Polyalthiopsis verrucipes, Fissistigma oldhamii, Fissistigma polyanthum, Uvaria macrophylla, Greenwayodendron suaveolens, Chieniodendron hainanense, Annona muricata, Annona reticulate, Cananga odorata, Monoon laui, Miliusa glochidioides, Artabotrys pilosus, Artabotrys hexapetalus, Polyalthia suberosa. The basic genomic features such as the total length of the sequence, the IR region, and the lengths of the four basic regions were preliminarily investigated (Wong et al. 2023). The IR region’s boundary expansion and contraction were analyzed and mapped using IRSCOPE (http://irscope.shinyapps.io/irapp) (Uckele et al. 2021). The large sequence repeats (LSR) Four possible types of sequence repeats (forward match, reverse repeats, complement repeats and palindromic repeats) were analyzed by the online software REPuter (http://bibiserv.techfak.uni-bielefeld.de/reputer/) with parameter settings: sequence identity greater than 90%, hamming distance of 3, and length greater than 30 bp (Rawal et al. 2021). The dispersed repeat sequences are found using the Tandem Repeats Finder (TRF). The simple sequence repeat (SSR) which is composed of repeating units in tandem with a length of 1–6 bp, can be used as a molecular marker and has been widely used in the analysis of population genetics of species (Mehmood et al. 2020). The software MISA (https://webblast.ipk-gatersleben.de/misa/) was used to identify simple repeat sequences in this chloroplast genome. The broad values of the number of repetitions from mononucleotide to hexanucleotide were set in order as 10, 5, 4, 3, 3, 3. All sequences were eventually manually corrected, and redundant repetitive sequences were manually removed (Tyagi et al. 2020).

Analysis of codon usage bias of chloroplast genes

Genomic data collection and sequences selection

After obtaining the whole genome sequence of chloroplasts of 14 species, according to the annotation information of gene bank the coding sequences (CDSs) were extracted. To improve the accuracy of codon bias analysis, we screened the sequences with the BioEdit software (Abdullah et al. 2020). First, the short (the length of the CDSs is less than 300 bp) and repetitive gene sequences were eliminated from the extracted CDSs. Then, the CDSs length must be multiples of three. More importantly, each CDS had a corrected start codon (ATG) at the beginning and a stop codon (TAA, TAG, and TGA) at the end (Waswa et al. 2023).

Calculation of codon related parameters and codon composition analysis

In this study, the CodonW1.4.2 software was used to analyze the CUB of 686 selected CDSs of 14 species. These include relative synonymous codon usage (RSCU), relative frequency of synonymous codon (RFSC), high-frequency codons (HF), effective number of codon (ENC), codon adaptation index (CAI), codon bias index (CBI), frequency of optimal codons (FOP), total amino acids (L_aa), and GC content at the 3rd base (GC3s) of the synonymous codon.

RSCU, RFSC, HF, ENC, CAI, CBI and FOP can be used to evaluate genomic CUB (Sahoo and Rakshit 2022). RSCU was used to detect variations in the pattern of usage of all synonymous codons across, it was a statistical measure of the factual degree of the relative recurrence of each synonymous codon and can be viewed simply as the ratio of the actual frequency of codon usage to its expected frequency (Cao et al. 2023). If an RSCU value greater than 1 indicates that the use frequency of the corresponding codon is higher than the expected frequency; RSCU value greater than 2 indicates that the use frequency of the corresponding codon is extremely high; RSCU value less than 1 indicates that the use frequency of the corresponding codon is lower than the expected frequency. The RFSC refers to the ratio of the sum of a codon observed in a test to the whole sum of synonymous codons, which reflects the usage frequency of each synonymous codon. For HF, the following principles were executed: RFSC > 60%; or the codon’s RFSC is greater than 0.5 times the average of synonymous codons (Chen et al. 2023). The sequences with ENC values less than 30 and more than 55 correspond to high and low expression genes, respectively. The lower ENC values indicate higher CUB (Morton 2021). The CAI value refers to the degree of conformity between the used frequency of synonymous codons and optimal codons in the coding region and takes a value between 0 and 1, the larger values indicate greater adaptation and higher levels of gene expression. When CBI = 1 indicates that only the optimal codon was used, while CBI less than 0 indicates that no optimal codons were involved. The FOP value is the ratio of the number of optimal codons to their synonymous codons, and the value ranges from 0 to 1, while the FOP value of 1 indicates that only the optimal codon is used and a FOP value of 0 indicates that no optimal codon is used (Zhang et al. 2018).

Optimal codons analysis

The optimal codons were identified by the procedure reported in the literature with slight modifications (Chen et al. 2023). First, we screened the eligible chloroplast genes of the 14 species, sorted the individual genes of the 14 species according to the ENC value from high to low, and screened about 10% of the genes from the anterior and posterior ends to construct a low-expression gene pool (ENC value is large) and high-expression gene pool (ENC value is small). The difference between the RSCU values ΔRSCU (ΔRSCU = RSCU high expression - RSCU low expression) was calculated by the software of CodonW 1.4.2 (Uckele et al. 2021). A codon that meets ΔRSCU values was not less than 0.08 is a high-expression codon. Second, the RSCU value of the corresponding codon was greater than 1 is an HF codon. Finally, those that satisfy both of the above two conditions were recognized as the optimal codon.

Comparative analysis of codon usage frequency

The frequency of codon usage is the thousandth ratio of the number of particular codons to the total number of codons of the encoded gene (Zhao et al. 2023). The codon usage frequency data of Arabidopsis thaliana (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi? species = 3702), Nicotiana tabacum (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=4097), Escherichia coli (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi? species=199310), and Saccharomyces cerevisiae (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi? species=4932) were downloaded from the Codon Usage Database (http://www.kazusa.or.jp/codon). In case the proportion is less than or equal to 0.5, or it is greater than or equal to 2, it indicates a more prominent distinction in CUB between two living beings to the extent that it would be inappropriate to select for a gene heterologous expression receptor (Cao et al. 2023). In contrast, the proportion is between 0.5 and 2, which demonstrates that the CUB is exceedingly comparative, then it can be used as a receptor of heterogenic expression.

ENC-plot analysis

For the analysis of the ENC-plot, we plotted scatter plots with GC3 as the horizontal coordinate and ENC as the vertical coordinate. The factor affecting codon preference were judged by the location of gene distribution (Wang et al. 2022). If the genes were distributed on or near the standard curve, the codon preference was mainly affected by mutation; if the genes were located farther below the standard curve, the codon preference was more affected by natural selection and other factors (Yengkhom et al. 2019).

Parity rule 2 plot analysis

In this study, we interpreted the mutational pressure and natural selection on nucleotide composition in DNA double strands of 14 species by the parity rule 2 plot (PR2-plot) analysis with reference to the method of Wu (2023). We took G3/(G3 + C3) as the horizontal coordinate and A3/(A3 + T3) as the vertical coordinate. It was assessed whether base mutations impact nucleotide base variation based on the proportions of A, T, G, and C bases. The position where the A3 value is equal to the T3 value and item G3 value is equal to the C3 value is the centroid, which indicates that there is no bias between the 2 complementary DNA strands by mutational pressure and natural selection. If the values of G3 and C3 or A3 and T3 are close to each other, it means that the codon preference of the chloroplast genome is only affected by mutational pressure; if the values of G3 and C3 or A3 and T3 are more different, it means that the codon preference is mainly caused by natural selection.

Neutrality plot analysis

Neutrality plot analysis is a method for quantitatively analyzing the effects of mutational pressure and natural selection on CUB (Tyagi et al. 2020). The mutations in synonymous codons usually occur at base position 3, whereas mutations at base positions 1 and 2 are mutations in non-synonymous codons, and they have a lower mutation rate. In the neutral plot, the average of GC1 and GC2 of each gene (GC12) was used as the vertical coordinate, and the GC3s value as the horizontal coordinate, with each point representing 1 independent gene. If all points are diagonally distributed, it means that there is no significant difference in the bases at the 3 positions of the codon, and CUB was only affected by mutation pressure. If all points are not diagonally distributed, and the correlation between the GC12 and GC3s is very low, it means that natural selection is the main influencing factor of CUB (Guo et al. 2020).

Correspondence analysis

The correspondence analysis (COA) was used to characterize potential effects on CUB. To illustrate the CUB of 14 species, the analysis was based on the 59 RSCU values (ATG, TGG, and 3 termination codons were excluded from the analysis) for each gene. We made scatter plots with axis1 and axis2 as horizontal and vertical coordinates, the codon usage pattern was determined based on the distribution of points, and the correlation analyses were performed for axis1, axis2, CAI, ENC, GC3s, and, L_aa (Geng et al. 2022).

Statistical analysis

The CodonW 1.4.2 software (https://codonw.sourceforge.net/) was used to analyze the codon characteristic parameters. The CUSP in-line program in EMBOSS (https://www.bioinformatics.nl/cgi-bin/emboss/cusp) was used to calculate each gene GC content and total GC content (GCall), codon position 1, 2, and 3 bases of GC content (GC1, GC2, and GC3), and T, C, A, and G content (T3, C3, A3, and G3) at the third base of the codon. SPSS 26.0 software and EXCEL 2021 software were used to analyze the correlation between the chloroplast genome codon using preference correlation indicators. The Figures in this manuscript were completed by the software of R programming language and origin. Microsoft Office Word was used to edit the entire manuscript.

Results

Comparative genomic analysis

The basic characteristic of chloroplast sequences and codon composition of 14 species

The chloroplast genome structures of the 14 species in this study were all typical tetrameric structures, containing two single-copy regions (LSC/SSC) and two inverted repeat regions (IRa/IRb). As shown in Table 1, among the results of the length of the chloroplast genomes of the 14 species, the A. reticulate was the longest (201,906 bp), the G. suaveolens was the shortest (159,031 bp), and the average length was 175,293 bp. The average length of the 14 species was 175,293 bp. The results of the LSC, SSC, and IR regions represented that the length of the LSC region ranged from 69,650 to 126,172 bp, the SSC region from 3014 to 37,606 bp, and the IR region from 14,018 to 129,242 bp. The range of the total genes was 260–392, the highest number of CDS sequences was found in A. reticulate (115) the least in G. suaveolens and M. laui (81). The difference in tRNAs quantity was only 4, and the number of rRNA was identical.

Table 1 The basic characteristic of chloroplast sequences of 14 species

The IR/SC sector boundary analysis of 14 species

The results of the IR/SC sector boundary analysis in 14 species are shown in Fig. 1. There were significant differences in IR/SC boundaries among the 14 species, but there were similarities within the same genus or closely related species. We observed a large expansion of the IR region in A. reticulate, A. muricate, and U. macrophylla, resulting in a large number of genes entering the IR region, which were all more than 100,000 bp in length, and which may be the reason why their total genome length was longer than other species. For the LSC/IRb boundary, P. verrucipes, M. laui, M. glochidioides, P. suberosa, and A. hexapetalus were all located in the gene between rps19 and rpl2; F. oldhamii and F. polyanthum were located the gene between trnI and ycf2, all other species were different. The IRb/SSC boundary of P. verrucipes, M. laui, and P. suberosa were within ycf1 and beyond 15 bp, and most species were close to the ndhF gene in the SSC region. For the SSC/IRa boundary, P. verrucipes, M. laui, M. glochidioides, and P. suberosa were within the ycf1 gene, and F. oldhamii was within the ndhF gene; A. reticulate and A. muricata have located the gene between rpl32 and tRNA. In addition, the IRa/LSC boundary of most species was adjacent to the trnH gene in the LSC region. The boundary of LSC/IRb and IRa/LSC had a lot in common with F. oldhamii, F. polyanthum, G. suaveolens.

Fig. 1
figure 1

LSC, SSC and IR boundaries of the chloroplast genomes of 14 species

Repeat sequence analysis of the chloroplast genomes of 14 species

As shown in Fig. 2A, we detected 79, 84, 80, 69, 63, 60, 83, 78, 101, 68, 67, 73, 72 and 73 SSR loci in 14 species, respectively. Among them, the richest repeat type was mono-nucleotides (30–69), followed by di-nucleotides (7–13), tetra-nucleotides (5–15), tri-nucleotides (3–11), penta-nucleotides (0–4), and hexa-nucleotides (1–9), respectively. The A/T repeat motif accounted for a large proportion (30–68) among all of the repeat types, followed by AT/AT (5–12), but the other repeat motifs were very rare (Fig. 2B). The SSR motifs in different repeat class types of the 14 species were identified by REPuter (Fig. 2C), the number of repeats in each species ranged from 58 (M. glochidioides) to 300 (F. oldhamii, F. polyanthum, U. macrophylla, A. muricate, A. reticulate), with forward repeats (31–180) and palindromic repeats(24–146) being the most abundant types, followed by reverse repeats (0–29), the complement repeats are rare in these 14 species, only C. odorata has it. The results of the tandem repeats showed a total number of 947 tandem repeats for the 14 species, ranging from 35 (P. verrucipes) to 160 (U. macrophylla) (Fig. 2D).

Fig. 2
figure 2

Repeat sequence analysis of the chloroplast genomes of 14 species. A: The number of SSR in 14 species; B: The number of SSR motifs in different repeat class types; C: The number of four repeat types; D: The number of dispersed repeat sequences; The number (A ∼ N) of the 14 species are shown in Table 1

Analysis of codon usage bias of chloroplast genomes

The basic characteristic of chloroplast sequences and codon composition of 14 species

As shown in Table 2, the 14 species chloroplast genome ENC was 48.600-51.041, CAI was 0.167–0.170, the mean value of CBI was − 0.008, the mean value of FOC was 0.363, and the L-aa ranged from 15,177 to 22,220, the CAI value is low, while the ENC value is relatively high, indicating that the 14 species chloroplast genome CUB was weak. In 14 species the frequency of GC1, GC2, GC3, and the average of GC were below 50%, which indicated that the sequences of chloroplast genes preferred to end with A/T. In the interim, the GC1 was upstairs GC2 and GC3, the GC3 was the lowest, there was a similar average of GC content among the 14 species (39.28-40.70%).

Table 2 The codon related parameters of 14 species

Analysis of RSCU and RFSC

In Table S1, the RSCU values of 686 CDSs of 14 species were analyzed and found that 30, 30, 30, 30, 30, 33, 30, 30, 29, 31, 28, 30, 30, 33, 32 codons of 14 species had RSCU values greater than 1, respectively. Among these codons with RSCU values greater than 1 in 14 species, the least number of codons ending in A/T was 83.3% and the greatest was 96.7%, while codons with RSCU < 1 are more biased to end in C/G, the results indicated that the codons preferred ending in A/T for the chloroplast genomes of 14 species. For codons with RSCU values greater than 1 there were 21 for 14 species, 12 ending in T, 8 ending in A, and 1 ending in G. By calculating the RFSC values of the 14 species, it was found that there were 17, 15, 14, 17, 18, 14, 14, 14, 15, 17, 15, 15, 14, 14 HF in each of the 14 species, and 8 identical HF codons for 14 species (TGT, GAT, GAA, CAT, AAA, AAT, AGA, TAT).

Determination of optimal codons of 14 species

As shown in Table 3 and Table S2, we screened a total of 244 optimal codons in 14 species with RSCU > 1 and ΔRSCU > 0.08, with 18,16,18,18,18,15,19,19,16,17,18,18,17,17 optimal codons in each species, respectively. All species shared 5 optimal codons (CGT, GTT, TGT, TTG, TTT). Among 244 optimal codons, 76 codons ended in A, 144 codons ended in T, 22 codons ended in G, and 2 codons ended in C, which indicated that the 3rd base of the optimal codon was biased towards A and T.

Table 3 Optimal codon of chloroplast genome of 14 species

Codon usage frequency analysis

The analysis of codon usage frequency of 14 species of Annonaceae and 4 four commonly used exogenous expression hosts illustrated that 14 species and E. coli had the largest differences, which were relatively small compared with the other 3 hosts (Table 4, Table S3). There were minor differences between A. pilosus and S. cerevisiae, the remaining 13 species exhibited the smallest differences from N. tabacum, even to the extent that A. hexapetalus and N. tabacum did not differ at all.

Table 4 The differences in codon usage frequency between 14 species and 4 hosts

ENC-plot analysis of 14 species

In the ENC-plot, it has been demonstrated that CUB on the standard curve is not affected by natural selection pressures, whereas codons distributed outside the standard curve are affected by natural or mutational selection pressures or other factors (Shahzadi et al. 2020). The results of this study demonstrated that 14 species partial genes (38%, 41%, 43%, 45%, 46%, 39%, 58%,51%, 40%, 37%, 36%, 34%, 36%, and 42%) were located on or near the standard curve (Fig. 3; Table 5), which indicates that the actual ENC value of this partial locus is the same as the theoretical ENC value, and suggests that this partial gene’s codon preference is not affected by the pressure of natural selection; whereas, among these, both A. muricate (58%) and A. reticulata (51%) were more than 50%, and the other part of the genes were farther away from the standard curve, which indicates that the actual ENC value of this part of the gene locus was farther away from the theoretical ENC value, then the codon preference of this part of the genes may be affected more by the pressure of natural selection.

Table 5 Frequency distribution of ENC ratio

In addition, in Fig. 3 the distribution of GC3s values for the 14 species is small (0.2244–0.401, 0.2308–0.4314, 0.2244–0.4142, 0.2308–0.3892, 0.2199–0.3774, 0.2308–0.4127, 0.2436–0.4513, 0.2308–0.4145, 0.2372–0.4032, 0.2179–0.3865, 0.2244–0.4047, 0.2244–0.4047, 0.2372–0.4207, 0.2372–0.4129), which also suggests that CUB was influenced by natural selection pressures.

Fig. 3
figure 3

ENC-plot analysis of chloroplast genomes of 14 species. The number (A ∼ N) of the 14 species are shown in Table 1

PR2-plot analysis of 14 species

The PR2-plot was used to reveal the influence of mutation and natural selection on CUB, by analyzing the utilization between A/T and G/C at the third codon position (Abdullah et al. 2020). As can be seen in Fig. 4, the distribution of bases A, G, C, and T in the four regions is uneven. In the vertical direction, most of the genes are located below the midline; in the horizontal direction, the number of genes on the right side of the midline is higher than that on the left side, but the difference in distribution is smaller than that in the vertical direction. Among them, the species M. laui has as many genes in the lower left and right region (quadrant III and IV), the species P. verrucipes and C. hainanense were the most abundant in the lower left region (quadrant III), and the remaining 11 species were the most abundant in the lower right region (quadrant IV). This suggests that base G frequency was greater than C, T was greater than A in the base composition of codon position 3 of 14 species, and further suggesting that most of the protein-coding genes in Annonaceae were subject to selective pressure.

Fig. 4
figure 4

PR2-plot of the chloroplast genomes of 14 species. The number (A ∼ N) of the 14 species are shown in Table 1

Neutrality plot analysis of 14 species

As can be seen in Fig. 5, the regression coefficients and correlation coefficients of A. reticulate and A. muricata were large (No. G, H), and their contributions of mutation pressure to codon preference reached 32.14% and 23.21%. These two species’ codon preferences were most affected by mutational pressure and least affected by natural selection in 14 plants. On the contrary, M. glochidioides (No. K) was one of the 14 species whose codon preference was the least affected by mutational pressure and the most affected by natural selection. The correlation between GC12 and GC3 was low in all 14 species, ranging from 0.0007 to 0.1037, which suggests differences in the evolutionary patterns of GC12 and GC3. The correlation coefficient of GC12 and GC3 ranged from 0.0287 to 0.3214, which indicated that mutational pressure and natural selection affected the codon preference of the 14 species, but natural selection had a greater effect on codon preference. This result was in agreement with those of the ENC-plot and PR2-plot analyze, it was also consistent with edible legumes (Chakraborty et al. 2021).

Fig. 5
figure 5

Neutrality plot of chloroplast genomes of 14 species. The number (A ∼ N) of the 14 species are shown in Table 1

Correspondence analysis of 14 species

The COA of 14 species were shown in Fig. 6. The explainable variation in 1st axis in 14 plants accounted for 17.27%, 17.81%, 18.85%, 17.82%, 16.64%, 17.17%, 18.99%, 18.77%, 18%, 16.05%, 17.44%, 18.11%, 18.09%, and 17.26% of the total variation, respectively. The explainable variation in the 2nd, 3rd, and 4th axis of the 14 plants were smaller than the 1st axis. Thus, the 1st axis contributed the most to the codon bias, but the 1st axis represented only part of the variations in CUB (Zhang et al. 2018). Moreover, the different degrees of variation in codon usage bias of the four axes suggested that there is not a single factor influencing CUB, but that it may also be related to mutation, natural selection, the gene length and function, and other factors. In addition, the percentage of genes with different GC contents varied significantly among the 14 species, 0.45 ≤ GC ≤ 0.60 was 4.26%, 7.84%, 9.80%, 8.51%, 4.35%, 6.52%, 12.00%, 10.20%, 6.00%, 5.88%, 6.00%, 6.00%, 8.00%, 6.25%, respectively.

Fig. 6
figure 6

Correspondence analysis of chloroplast genomes of 14 species. The number (A ∼ N) of the 14 species are shown in Table 1

Furthermore, the correlation analysis was performed between axis 1, axis 2, and the codon index including CAI, ENC, GC3s, and L_aa to explore the factors influencing CUB (Fig. 7). We can see a correlation between axis 1 and CAI in all 14 species, and no correlated with ENC and GC3s, while axis 1and L_aa was positively correlated only for G. suaveolens, and not for any other species. For axis 2 and CAI, only the CAI of F. oldhamii was negatively correlated with axis 2; among ENCs, those of P. verrucipes and M. glochidioides were negatively correlated with axis 2, while those of C. hainanense and P. suberosa were positively correlated with axis 2; among GC3s, those of M. glochidioides was positively correlated with axis 2, the GC3s of P. suberosa, A. pilosus, and A. hexapetalus were positively correlated with axis 2, and other species were not correlated with axis 2; whereas, among the correlations between L_aa and axis 2 of 14 species, only F. oldhamii and G. suaveolens had not correlate L_aa and axis 2, and all other species had different degrees of correlation.

Fig. 7
figure 7

Correlation analysis of axis 1, axis2 and codon utilization index of chloroplast genomes of 14 species. The number (A ∼ N) of the 14 species are shown in Table 1. *P < 0.05, **P < 0.01, ***P < 0.001

Discussions

The chloroplast genome is the second largest genome and belongs to a relatively independent genome from the nuclear genome (Rehman et al. 2021). Although the chloroplast genome is relatively stable, differences in structure, size, and number of genes may exist among families and species, this may be related to the contraction and expansion of the IR region and gene loss during genome evolution. Such differences can provide useful information for taxonomic, genetic, and ecological studies of plants (Konhar et al. 2019). In our study, the gene sizes of the 14 species varied greatly, ranging from 159,031 to 201,906 bp, which is consistent with the previously reported full-length range of chloroplast genome sequences, it demonstrates high conservation on Annonaceae chloroplast genes. The number of chloroplast genes ranged from 260 to 392, which may be related to the metabolic pathways and ecological adaptations of different plants. The SSRs in chloroplast genomes were important molecular markers, that are widely used in the study of plant genetic diversity, species identification, and conservation biology. The SSRs of 14 species ranged between 63 and 101, and like most of the species mono-nucleotides had the highest number of repeats followed by di-nucleotides, tetra-nucleotides, tri-nucleotides, the penta-nucleotides, and hexa-nucleotides had the least number of repeats. Moreover, the A/T type is much larger than the G/C type, which is consistent with the findings of other plants such as Oryza (Chakraborty et al. 2020), Theaceae (Wang et al. 2022) and Orchidaceae (Liu et al. 2023). In addition, the contraction, expansion, and deletion of the reverse repeat region can cause differences in the chloroplast genome, and the analysis of the IR/SC boundary and sequence variation in Annonaceae found that there were significant differences in IR/SC boundaries among the 14 species, but there were similarities within the same genus or closely related species. In general, the whole genome is still more conservative, and the existence of different can provide a molecular basis for species identification and phylogenetic analysis in Annonaceae.

The ENC of 14 species were in the interval 48.600–51.041, with low gene expression levels, so CUB was relatively weak. In this study, the codon GC content of the chloroplast genomes of 14 species was lower than 50% and showed GC1 > GC2 > GC3, which also proved the hypothesis “codons in higher plants tend to use A/T endings” (Campbell and Gowri 1990). PR2-plot analyses of 14 species represented that the CUB was not only affected by mutation but also by natural selection and other factors. Neutrality plot analysis and ENC-plot analysis also depicted that the CUB was largely influenced by natural selection, but little by mutational pressure during the evolution of Annonaceae. Although both mutational pressure and natural selection can independently lead to the formation of CUB, and there are differences in the degree of impact and diversity among different species, the combination of these two mechanisms and the cumulative effect over a long period were primarily responsible for the formation of CUB. The correlation analyses in 14 species demonstrated that the base composition of the 3rd position of the codon was different from the 1st and 2nd, it was related to the gene length and GC content. In addition, it was also found that the base composition of the 3rd position of the codon had a greater effect on the CUB, which suggested that there are differences in the factors influencing CUB in different species. The study of optimal codons in organisms helps to maximize and improve translation speed and accuracy (Chi et al. 2023). A total of 244 optimal codons were screened in 14 species in this study, ranging from 15 to 19, it suggested that different species of Annonaceae have different strategies of optimal codon use in the process of adaptation to natural selection. A total of 220 codons ending in A/T, it was further confirmed that chloroplast genes of 14 species preferred to use NNA and NNT codons. Hershberg et al. (2008) revealed that the number of optimal codons was relatively small under purifying selection, and the number of optimal codons increased under high natural selection and mutation pressures. Therefore, it can be assumed that the chloroplast genomes of 14 species were not under purifying selection.

Conclusions

In general, the genome characterization and codon usage patterns of chloroplast genomes were similar but not the same among Annonaceae. The chloroplast genome size of 14 species ranged from 159,031 to 201,906 bp, with LSC, SSC, IRa, and IRb tetrameric structure, and the size of each region varies greatly depending on the species. From the results, genome size, total number of genes, and CDS had some correlation. The SSRs ranged from 63 to 101, the richest in repeat type was mono-nucleotides, and the base composition preferred to use A/T; as with most species, the forward repeats and palindromic repeats were the most abundant types. The IR/SC sector boundary of chloroplast genomes of 14 species had obvious contraction and expansion. The frequency of GC was less than 50% in all 14 species. The codon-related parameters (ENC, CIA, CBI, FOC, and L_aa) of 14 species varied significantly. These species had 21 identical codons with bias (RSCU > 1), and there were 8 identical HF codons and 5 identical optimal codons. In ENC-plot and Neutrality plot analyses, CUB of all species except A. muricate and A. reticulata were relatively influenced by natural selection. The PR2-plot illustrated that base G frequency was greater than C, and T was greater than A. The correspondence analysis represented that codon usage patterns were different in Annonaceae.