Introduction

Codons, as the link between deoxyribonucleic acids and proteins in organisms, play an essential role in transmitting genetic information (Liu et al. 2020). Codon usage bias (CUB) refers to the non-random use of synonymous codons to encode the amino acids of a protein in the deoxyribonucleic acid transcription. Prior studies have shown that CUB was universal and confirmed in many organisms, such as bacteria, fungi, animals, and plants. CUB differs not only among species but also within the cell, such as the nucleus, chloroplast (cp.), and mitochondrion of a cell, and even between various genes of the same genome (Liu et al. 2017). Accompanying the swift progress of high-throughput sequencing techniques in recent years, genomic sequencing of multiple species has been accomplished, helping to understand the CUB at the genome-wide level. The CUB in gene families and the whole genome has been intensively studied in many model and non-model organisms, such as Arabidopsis thaliana (Chiapello et al. 1998), Nicotiana tabacum (Anwar et al. 2021), phytoplankton (Krasovec and Filatov 2022), and Gnetum luofuense (Deng et al. 2021b). By CUB analysis, genetic and mutational events influencing genes and the whole genome can be determined overall, and the regulatory mechanism in the genetic expression profile can be further revealed (Kumar et al. 2004; Shah and Gilchrist 2011; Zhou et al. 2014). Moreover, species with close genetic relationships will share a similar CUB, which would supply substantial evidence for identifying new germplasm resources and play a role in illustrating the evolutionary relationship among species (Ma et al. 2015). Furthermore, studies on CUB can be applied to predict the optimal heterologous expression receptor plant for target genes, as the transcription process (Zhou et al. 2016), translation efficiency of proteins (Frumkin et al. 2018), and RNA toxicity (Mittal et al. 2018) can be affected by CUB. It is of great significance for constructing expression vectors and investigations on unsuspected functional genes (Zelasko et al. 2013; Quax et al. 2015).

The CUB was also found to differ in the cp. genomes of different species (Liu and Xue 2005). To increase the adaptability of cp. genetic transformation and the expression level of target nucleic acids, it was reasonable to study the CUB in the whole cp. genome. Researchers have also studied the CUB in cp. genomes in a few species, such as Calligonum mongolicum (Duan et al. 2020), Panicum species (Li et al. 2021), Cyperus alternifolius, Thalia dealbata, and Canna indica (Deng et al. 2021a). Elaeagnus plants were used as hedge plants in urban areas for their decorative aspects (aromatic flowers and glistening leaves), dryness resistance, adaptability to various soil and water environments, contamination prevention, and ability to attract insects and birds. However, in recent years, their fruits have been found to be nutrient-rich, and other parts have been recognized for their medicinal values (Patel 2015). Consequently, in addition to being edible, parts of the plant were utilized in traditional medicine as heat-clearing, muscle relaxing, analgesic, anti-inflammatory, astringent, and antifungal agents (Bendaikha et al. 2014; Saboonchian et al. 2014). Even though researchers have sequenced the complete cp. genomes of various Elaeagnus species (Choi et al. 2015; Wang et al. 2017; Liu et al. 2019; Lu et al. 2022), CUB studies on these cp. genomes have not yet been published.

In this work, the CUB in the cp. genomes of nine Elaeagnus species, such as Elaeagnus angustifolia, Elaeagnus glabra, Elaeagnus henryi, Elaeagnus loureirii, Elaeagnus macrophylla, Elaeagnus mollis, Elaeagnus multiflora, Elaeagnus pungens, and Elaeagnus umbellata, was identified (Fig. 1). The base composition and optimal codons of each species were compared. Based on previous studies, a correlation analysis, neutrality plot, effective number of codons (Nc) plot, parity rule 2 (PR2) plot, and t-distributed Stochastic Neighbor Embedding (tSNE) reducing dimension clustering were performed to investigate their CUB. Furthermore, to infer the history of their evolutionary relationship and predict the best heterologous expression receptor plant, the relative synonymous codon usage (RSCU) values of the nine species were also calculated and cluster analysis was generated, which will provide theoretical support for the subsequent cp. genomic research of the Elaeagnus species.

Fig. 1
figure 1

Figures of the nine Elaeagnus species. A Elaeagnus angustifolia. B Elaeagnus glabra. C Elaeagnus henryi. D Elaeagnus loureirii. E Elaeagnus macrophylla. F Elaeagnus mollis. G Elaeagnus multiflora. H Elaeagnus pungens. I Elaeagnus umbellata

Materials and methods

Flowchart

The flowchart of the materials and methods in this work can be seen in Fig. 2.

Fig. 2
figure 2

The flowchart of the materials and methods used in this work

Sequence data

The entire cp. genomes of nine Elaeagnus species were downloaded from the GenBank database of the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/). Only the coding sequence (CDS), whose length was greater than or equal to 300 bp, was used for follow-up studies. The CDS was also required to be with ATG as its initiator codon and TAA, TAG, or TGA as its terminator codon. Additionally, the repeat sequences were discarded using the Perl language script (Table 1).

Table 1 Information of the chloroplast genomes of nine Elaeagnus species

Statistical analysis of parameters related to CUB

Initially, the original data were obtained utilizing the codonW 1.4.2 software (http://codonw.sourceforge.net/). Moreover, equalizing values and homologous mobility scales of 15 parameters, such as aromaticity score (Aromo: the frequency of aromatic amino acids), codon adaptation index (CAI), codon bias index (CBI), general average hydropathicity (GRAVY), frequency of optimal codons (Fop), number of amino acids (L_aa), number of synonymous codons (L_sym), and effective number of codons (Nc), of the nine Elaeagnus species were calculated. Furthermore, the codons’ GC, GC3s, GC12, T3s, C3s, A3s, and G3s (average base content) of each cp. genome were calculated via Excel 2019 and plotted into a bar chart on SPSS v26. The Pearson correlation analysis of the parameters above was also performed via the Sangerbox 3.0 cloud platform (Shen et al. 2022), taking the initial results of E. mollis as a representative.

Optimal codons analysis

In general, the synonymous codons with more occurrences among the 64 codons were known as optimal codons (Ikemura 1985). The theoretical range of the Nc value is between 20 and 61 (included). Additionally, the bigger the Nc value, the broader the synonymous codon selective range, and the weaker CUB will be (and vice versa) (Wright 1990). The expectation value of the Nc was computed as:

$${\text{ENc}} = 2 + GC3s + 29/[GC3s^{2} + (1 - GC3s)^{2} ].$$

The filtered CDSs were ordered by the Nc values at the beginning. Subsequently, 10% of the genes (approximately five CDSs for each species) were picked from both sides as high (with small Nc values) and low (with big Nc values) expression groups. Their homologous RSCU values were calculated utilizing the CUSP tool (https://www.bioinformatics.nl/cgi-bin/emboss/cusp) present in the European Molecular Biology Open Software Suite (EMBOSS) online software. The RSCU value of a codon was computed as:

$$\text{RSCU=}\frac{{\text{X}}_{\text{ij}}}{{\sum}_{\text{j=1}}^{{\text{n}}_{\text{i}}}{\text{X}}_{\text{ij}}}\cdot{\text{n}}_{\text{i}}$$

where \({\text{X}}_{\text{ij}}\) stands for the probability that the jth codon appeared for the ith amino acid, and \({\text{n}}_{\text{i}}\) is the number of codons encoding the ith amino acid (Chakraborty et al. 2020; Li et al. 2021). The theoretical range of RSCU is 0–6. Eventually, the codons that reached the condition (RSCU > 1 and ∆RSCU > 0.08) were regarded as the optimal codons (Table S1) (Zhang et al. 2007), and the results were represented by the UpSet plot using the Sangerbox 3.0 cloud platform.

Analysis of the CUB influencing factors

The dominant factor influencing the CUB in the cp. genes was estimated through three graph analysis methods: the neutrality plot, Nc plot, and PR2 plot, on SPSS v26 and Adobe Illustrator CC 2018 (AI 2018). The neutrality plot is the one-variable linear regression of GC12 on GC3s. In contrast, the Nc plot is the comparative study of actual Nc values with the specification curve of expected Nc values. Moreover, the PR2 plot illuminates the imbalance of A/T and G/C base mutations at the 3rd position of codons.

Evolutionary relationship analysis

Clustering analysis of the nine Elaeagnus species was initially performed based on the RSCU values via the complete linkage method and Euclidean distance using Sangerbox 3.0. Furthermore, tSNE dimensionality reduction clustering analysis on the Sangerbox 3.0 platform was utilized for the nine Elaeagnus species sorted into Sect. Deciduae and Sect. Sempervirentes by the physiological and ecological characteristics of plant leaves adapting to the environment [first proposed by Servettaz (1909) and accepted by Chang (1983)]. In addition, the phylogenetic tree was constructed adopting the maximum likelihood estimation method (ML) via the FastTree v2.1.11 software (http://www.microbesonline.org/fasttree/), based on the filtered CDSs, in which sequence alignments were performed previously using the Multiple Alignment using Fast Fourier Transform (MAFFT) v7.480 procedure (https://mafft.cbrc.jp/alignment/software/windows.html).

Prediction of optimal heterologous expression receptor plant

Clustering analysis oriented to the typical Elaeagnus species (E. mollis) and model organisms, such as the whole genomes of Bacillus subtilis, Escherichia coli, Microcystis aeruginosa, Saccharomyces cerevisiae, and Staphylococcus aureus, together with the cp. genomes of A. thaliana, Nicotiana sylvestris, T. aestivum, and O. sativa, was carried out based on RSCU values acquired from the codon usage database (http://www.kazusa.or.jp/codon/) or calculated as previously done on the Sangerbox 3.0 cloud platform (Table S2).

Results and discussion

Statistical analysis of parameters related to the CUB

According to the mean values and corresponding variation ranges of the parameters closely associated with CUB, the major cp. CDSs were prone to mutation in the Elaeagnus species. Meanwhile, all the rps7 genes of the nine cp. genomes were identical in sequence (Table S3). The cp. genomes of the nine Elaeagnus species possessed duplicated genes, such as ndhB, rpl2, rps7, ycf1, and ycf2, and specific genes not existing in all the nine sequence files or genes that were shorter than 300 bp, such as clpP, ndhD, ndhE, petB, rpl16, and rps12, which was parallel to that of former studies in the cp. genomes of the Panicum species (Li et al. 2021), C. alternifolius, T. dealbata, and C. indica (Deng et al. 2021a). It was also found that the longer the length of the single-copy gene, such as ndhF, rpoC1, rpoB, psaA, and psaB, the more stable the gene would be according to the variation ranges of Nc, which was also similar to previous research on the Panicum species (Table S3) (Li et al. 2021). It may indicate that the long single-copy genes played an important role in the plant cells and were stable in sequences.

Most of the 20 amino acids are encoded by four codons whose 1st and 2nd bases are uniform. Therefore, CUB is reflected in the base composition at the 3rd position of codons in many cases (Zhao et al. 2019). The codons of the CDSs in the cp. genomes of the nine Elaeagnus species were found to favor ending with an A/T base. The mean G/C base contents of the 1st and 2nd positions were notably higher than that of the 3rd position, parallel to former studies on the CUB in the cp. genomes of plants, such as the Lespedeza species (Somaratne et al. 2019), Camellia species (Yengkhom et al. 2019)d Mongolicum (Duan et al. 2020). Nevertheless, the G3s values of the cp. genes in the Elaeagnus species appear to be higher than the species above, resulting in higher GC3s and GC values as well (Fig. 3A). It may lead to the higher stability of DNA structures in the cp. genomes of Elaeagnus species, as there are three hydrogen bonds between G and C bases.

Fig. 3
figure 3

Codon usage pattern of the nine Elaeagnus species. A Codon base composition of the nine Elaeagnus species. A3s/T3s/C3s/G3s, the frequency that codons have an A/T/C/G at their 3rd position of synonymous codons; GC3s, the G/C contents of the 3rd position of codons; GC12, the average G/C contents of the 1st and 2nd positions of codons; GC, the average G/C contents of the three positions of codons. B Pearson correlation analysis of parameters related to CUB towards the cp. genome of Elaeagnus mollis. CAI, codon adaptation index; CBI, codon bias index; Fop, frequency of optimal codons; Nc, effective number of codons; L_sym, number of synonymous codons; L_aa, number of amino acids; Gravy, general average hydropathicity; Aromo, aromaticity score (the frequency of aromatic amino acids); *Significant at p < 0.05 (two-tailed); **Significant at p < 0.01 (two-tailed); ***Significant at p < 0.001 (two-tailed); ****Significant at p < 0.0001 (two-tailed)

In line with the output result of the Pearson correlation analysis, the CAI had a significantly positive correlation with the CBI (p < 0.0001, r = 0.74) and Fop (p < 0.0001, r = 0.77), indicating the association between CUB and external gene expression (Fig. 3B). The more optimal codons were used, the higher the gene expression level would be. Analogical results were also observed in the cp. genome of Hemiptelea davidii (Liu et al. 2020). Moreover, the CAI was positively correlated with T3s (p < 0.0001, r = 0.52) and C3s (p < 0.01, r = 0.34) and negatively correlated with A3s (p < 0.001, r=-0.44) and G3s (p < 0.05, r=-0.31), testifying that the cp. genes finishing with a T/C base in E. mollis generally have a higher expression level. There was a significantly positive correlation between the CBI and C3s (p < 0.001, r = 0.44) and negative correlations between the CBI and G3s/(G3s + C3s) (p < 0.0001, r = − 0.58), and G3s (p < 0.001, r = − 0.43), indicating that the codons of cp. genes in E. mollis prefer C-termination rather than G-termination. In addition, the Nc had a significantly positive correlation with GC3s (p < 0.001, r = 0.47), which was also seen in parallel studies in the cp. genomes of Porphyra umbilicalis (Li et al. 2019) and Mesona chinensis (Tang et al. 2021) as well, and G3s (p < 0.001, r = 0.47), indicating that the more codons ending with G/C bases were applied, the weaker the CUB will be in the cp. genome of E. mollis.

The GRAVY had a significantly negative correlation with A3s/(A3s + T3s) (p < 0.01, r = − 0.41) and A3s (p < 0.05, r = − 0.32) and a significantly positive correlation with T3s (p < 0.05, r = 0.37), demonstrating that there were usually more T-terminated codons in the genes, from which the hydrophobicity of the translated proteins was relatively strong in the cp. genome of E. mollis. The Aromo had a significantly positive correlation with T3s (p < 0.0001, r = 0.53) and a significantly negative correlation with A3s/(A3s + T3s) (p < 0.001, r = − 0.46) and GC (p < 0.01, r = − 0.41), which indicates that TTT codons were more frequently used for phenylalanine and TAT codons were more frequently used for tyrosine in the cp. genes of E. mollis. Additionally, former studies have demonstrated that Axis 1 (the leading factor) acquired via correspondence analysis (COA) on the codon usage pattern of photosynthesis-associated genes was observably correlated with the GRAVY, Aromo, and length of the deoxynucleotide chain but did not correlate with that of genetic system-related genes, supporting that the photosynthesis-associated genes play a pivotal role in the cp. genomes (Zhang et al. 2018).

Optimal codons analysis

Since the RSCU values of the codons in the cp. genes of E. mollis were analogous to the previous study on cp. genomes, this gives great credibility to the RSCU values calculated in this study (Cheng et al. 2020). Derived from the UpSet graph, the numbers of the optimal codons for the cp. genomes of the nine Elaeagnus species were no less than 15 and no more than 19 (Fig. 4). ATT, GAA, CGA, GTT, AAA, GTA, AGT, CGT, TTA, and GGT (10) were the optimal codons that synchronously arose in the nine Elaeagnus species, which signifies that these species all prefer using them. Additionally, the optimal codons for the cp. genomes of the nine species were comparatively similar, albeit with minor differences. Previous studies have suggested that the optimal codons for the cp. genome of Populus alba were CGT, GTC, TCT, and TTA, while those for A. thaliana were AAC, CGT, GGT, GTT, TAC, and TCA. The cp. genes of T. aestivum preferred to use AAA, ACT, CCT, CGT, GAG, GGT, TAC, and TCT, while the cp. genes of Cycas taitungensis favored AAT, CAT, CCA, GGT, GTA, TAT, TCA, TTA, and TTT as their codons (Zhou et al. 2008). Compared with the species mentioned above, the number of optimal codons for the cp. genomes of Zea mays (10), Pinus koraiensis (12) (Zhou et al. 2008), Gynostemma species (8–12) (Zhang et al. 2021), and the Euphorbiaceae species (17–18) (Wang et al. 2020) were more similar to that of the Elaeagnus species (14–19). The large numbers of optimal codons may reflect their stronger CUB.

Fig. 4
figure 4

Optimal codons of the nine Elaeagnus species. The UpSet graph shows the optimal codon sets of every species. The left displays the total optimal codon numbers of each species on the right. If there was a solid black dot in the middle part, it means that the species on the right has the codons above as its optimal codons

Analysis of the CUB influencing factors

Many factors influence CUB, such as the length of gene sequences (Marais and Duret 2001; Stoletzki 2011; Ribeiro et al. 2012), the codon position, protein translation efficiency (Haupt et al. 2009; Li et al. 2017), tRNA abundance (Buchan et al. 2006), gene mutations, and natural selection (Nandy 2002; Suzuki 2010). Nevertheless, researchers have discovered that base mutation and natural selection were the leading factors affecting CUB among different species (Fedorov et al. 2002; Hiraoka et al. 2009). In this work, E. mollis was used as an example to reveal the dominant factors influencing the CUB in the Elaeagnus species.

If GC12 was significantly correlated with GC3s in the neutrality plot, this would suggest that there was no difference in codon base usage between the 1st, 2nd, and the 3rd positions of codons, and mutation pressure was the leading factor influencing CUB. In contrast, if the correlation between GC12 and GC3s was not that significant, and the slope of the fitting curve was approaching zero, this would suggest that it is discrepant in the usage of the 1st, 2nd bases, and the 3rd codon bases, and the CUB was significantly correlated with the 3rd bases, which was strongly influenced by natural selection (Sueoka 1988; Liu and Xue 2004). Following the neutrality plot based on E. mollis, the correlation between GC12 and GC3s was insignificant (R2 = 1.764 × 10− 4). The slope (k = 0.02) was confoundedly approaching zero, demonstrating that it significantly differed in the usage of the 1st, 2nd, and the 3rd bases of its cp. genic codons. Natural selection strongly affected the CUB in the cp. genome (Fig. 5). Compared with the cp. genes of Guizotia abyssinica (R2 = 0.0282, k = 0.225) and Helianthus annuus (R2 = 0.0293, k = 0.2388) (Nie et al. 2014), the correlation coefficients and slopes of the neutrality plots based on the Elaeagnus species were evidently smaller (Fig. 5), suggesting that their CUB was more influenced by selection pressure than the two species above, with a strong preference. Different from the above result, the R2 values and slopes of Elaeagnus species were closer to that of Punica granatum (R2 = 0.0036, k = 0.1165) (Yan et al. 2019), Triticum aestivum (R2 = 0.0105, k = 0.1222) (Zhang et al. 2007), Ageratina adenophora (R2 = 0.008, k = 0.1148), and Jacobeae vulgari (R2 = 0.0057, k = 0.0809) (Nie et al. 2014), indicating that their CUB strengths approach to one another. Among them, the GC12 of the cp. genes in E. henryi was negatively correlated with GC3s (R2 = 6.381 × 10− 5, k = − 0.01), which is parallel to that of Lactuca sativa (R2 = 7 × 10− 6, k = − 0.0036) (Nie et al. 2014), suggesting that the CUB in these two species were more than strongly influenced by natural selection. Their traits controlled by cp. genes have been subjected to selection pressure for a period of time.

Fig. 5
figure 5

Neutrality plot of the cp. genes (≥ 300 bp) in the nine Elaeagnus species. In the neutrality plot, a significant correlation between GC12 and GC3s suggests that there was no difference in codon usage between the 1st, 2nd, and the 3rd positions of codons and mutation pressure was the leading factor influencing codon usage bias. If on the contrary, this would suggest that it is discrepant in the usage of the 1st, 2nd bases, and the 3rd bases, and the codon usage bias was significantly correlated with the 3rd bases, which was strongly influenced by natural selection

Provided that the data points representing genes were more abundantly located above the expected value curve in the Nc plot, their CUB was primarily influenced by the gene mutation. While the data points were more distributed far below the standard curve, selection pressure was the leading factor affecting the CUB in these genes (Wright 1990; Jia et al. 2009; Pan et al. 2009). According to the outcome of the Nc plot based on the cp. genomes of E. mollis, the CUB in rpl2 (ribosomal protein gene) at the top of the graph was profoundly influenced by base mutation, with a weak preference (Fig. 6). Natural selection was the dominant factor affecting the CUB in the genes with a strong preference, such as rps12 and psbA (PSII-A core protein of photosystem II), which were distributed far below the curve. In addition, the different cp. genes of E. mollis significantly differed in CUB. Compared with the H. annuus (Chen et al. 2021) and Platycarya species (Wang et al. 2021), the Elaeagnus species appeared to be more points above the curve in the Nc plot, demonstrating that the CUB in their cp. genomes was considerably less influenced by selection pressure, with weaker preferences. Their cp. genes showed higher diversity, and the environment can accommodate this difference.

Fig. 6
figure 6

Nc plot of the cp. genes (≥ 300 bp) in the nine Elaeagnus species. Provided that the data points representing genes were more abundantly located above the expected value curve in the Nc plot, the codon usage bias was primarily influenced by the gene mutation. While the data points were more distributed far below the standard curve, selection pressure was the leading factor affecting the codon usage bias of these genes

Similarly, the PR2 plot analysis was also one of the methods to determine the influence of mutation stress and natural selection on the CUB in genes (Sueoka 2001). By plotting a cruciform scatter graph with G3s/(G3s + C3s) as its abscissa and A3s/(A3s + T3s) as its ordinate, most of the points representing genes did not approach the center point, indicating that there were some other factors influencing the CUB in genes aside from genetic mutation, such as natural selection (Chakraborty et al. 2020; Tang et al. 2021). For genes close to the center point, such as rbcL (ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit) and rpl22 among the cp. genes of E. mollis, their CUB was primarily influenced by the genetic mutation, with a weak preference. In the case of the points localizing the four corners of the plot, such as rps12 (also found to be with a strong bias in the Nc plot method), atpF, atpI (ATP synthase), and psaA (PSI-A core protein of photosystem I) in the cp. genes of E. mollis, it verifies that other factors influence their CUB in addition to base mutation, such as selection pressure, and the preference was strong (Fig. 7). Likewise, few genes closely approached to the central point in all Elaeagnus species in the PR2 plots, testifying that natural selection pressure played an essential role in the formation of their cp. gene CUB. The PR2 plot outcomes of the Elaeagnus species were similar to that of six Euphorbiaceae species (Wang et al. 2020). In comparison, the points representing the cp. genes of the species above were comparatively more distributed farther from the center point than that of Coffea arabica in the PR2 plots (Nair et al. 2012), revealing that the CUB in the cp. genomes of the Elaeagnus and Euphorbiaceae species were more affected by natural selection than C. arabica. Moreover, it suggested that their codon usage had more strong preferences.

Fig. 7
figure 7

PR2 plot of the cp. genes (≥ 300 bp) in the nine Elaeagnus species. Most of the points representing genes did not approach the center point, indicating that there were some other factors influencing the codon usage bias of genes aside from genetic mutation, such as natural selection

Evolutionary relationship analysis

The RSCU values of different codons in distinct species manifest their evolutionary relationship in a sense, as codons associate gene sequences with polypeptide sequences. Additionally, they serve as supporting information in improving the taxonomic study of the Elaeagnus species (Li et al. 2019). Furthermore, to some extent, a cp. is matroclinously inherited, more suitable, and more convenient for phylogenetic analysis than the whole genome.

Following the clustering analysis based on the RSCU values, the nine Elaeagnus species were classified into five categories (Fig. 8A). Species in C0, C1, and C3 were all plants from Sect. Deciduae, while species in C2 and C4 were all from Sect. Sempervirentes. Moreover, the tSNE dimensionality reduction clustering analysis was also conducted in the nine Elaeagnus species grouped by evergreen and deciduous ecological characters based on their respective RSCU values. The trendlines of these two sections were whole separated (95% confidence interval), indicating that the RSCU values of the cp. genomes from these two sections differed from each other to a degree (Fig. 8B). In addition, it is possible for the following researchers to entirely distinguish the plants from these two groups based on the RSCU, which may help in the quick species identification of new genetic resources belonging to Elaeagnus.

Fig. 8
figure 8

Evolutionary relationship analysis of the nine Elaeagnus species. A The clustering analysis of the nine Elaeagnus species based on the RSCU values. The tSNE dimensionality reduction clustering analysis of the nine Elaeagnus species classified by sections based on the RSCU values. C The phylogenetic tree of the nine Elaeagnus species based on CDS sequences with their fruits (≥ 300 bp). The clustering analysis was performed via the complete linkage method and Euclidean distance. The closer the species are, the more similar their codon usage biases are

The Sect. Sempervirentes species are evergreen erect or climbing shrubs with early flower opening and fruit ripening, while the Sect. Deciduae species are deciduous or semi-permanent green upright shrubs or trees with late flower opening and fruit ripening (Servettaz 1909). It was also suggested that Sect. Deciduae should be renamed Sect. Elaeagnus (Sun and Lin 2010). In line with the former study, 15 Elaeagnus species were clustered into three branches dissimilar in the above biological and ecological traits based on the matK sequences, which were significantly stable in their cp. genomes, using ML analysis, adhering to the characteristics of the traditional taxonomic classification. In comparison, the Elaeagnus species could not be gathered smoothly for morphological clustering via principal component analysis (PCA). Additionally, the polygenetic trees based on ITS (nrDNA) sequences were also of significant discrepancy compared with that of matK genes via ML and maximum parsimony (MP) methods, and the ITS genes had duplex inheritance while the matK genes had matrilineal inheritance (Cheng et al. 2022). In this study, the identical outcome did not occur as well based on the whole cp. genomes via the same ML method, which may result from rapid changes in the CUB in some cp. genes in the Elaeagnus species affected by their local environment. Moreover, the fruit of E. mollis was significantly different from that of other species since it was the only one with eight ridges, and the species was also found to be distant from other in the clustering analysis derived from the RSCU values (Fig. 8A, C).

Additionally, the CUB in the cp. genomes of the nine Elaeagnus species was extraordinarily similar to one another as the RSCU values of their codons were discovered to be tremendously close (Fig. 8A). Combining the clustering heatmap and the tSNE dimensionality reduction clustering based on the RSCU values with the phylogenetic tree generated from CDSs, it was found that the species clustered into one group in the clustering heatmap relatively approached one another in the other two clustering results as well, except for E. henryi (Fig. 8). In addition, the tSNE dimensionality reduction clustering was more similar to the ML-based phylogenetic tree, suggesting that the tSNE clustering method based on RSCU was more appropriate for evolutionary relationship analysis on the Elaeagnus species than the complete linkage method.

Prediction of optimal heterologous expression receptor plant

In genetic engineering research, such as external gene expression, molecular breeding, and functional verification, the matching degree of foreign genes and receptor genomes is essential for the successful acquirement of transgenic materials. Considerable distinctions between their CUB will probably create methylation hotspots, resulting in the silencing or diminished expression of external genes (Perlak et al. 1990). In addition to influencing the translation speed and folding of proteins, CUB can also affect transcriptional regulation at the mRNA level (Chen et al. 2017) and the expression of exogenous genes (Zhou et al. 2018). Based on the clustering heatmap, the RSCU values of the codons in the cp. genome of E. mollis were more adjacent to that of A. thaliana, indicating that a higher expression level would be attained by way of adopting the cp. genome of A. thaliana as the heterologous expression vector for the cp. genes of E. mollis (Fig. 9). In comparison, based on previous studies, the cp. genomes of A. thaliana, Populus trichocarpa, and S. cerevisiae can be regarded as compatible heterogeneous expression receptor plants for the cp. genes of the Miscanthus species (Sheng et al. 2021) and Euphorbiaceae (Wang et al. 2020). Since the three species had similar optimal heterogeneous expression receptor plants, it may also suggest that they have a closer kinship in a sense. Astonishingly, the whole genome of M. aeruginosa (Cyanophyta) was clustered more closely to the cp. genomes of plants (C3 and C4), thus sustaining the endosymbiont hypothesis on the genesis of cp. to some extent.

Fig. 9
figure 9

The clustering analysis compared with model organisms based on the RSCU values. The clustering analysis was performed via the complete linkage method and Euclidean distance. The closer the species are, the more similar their codon usage biases are

Conclusion

Most of the cp. genes in the nine Elaeagnus species were prone to mutation, while the rps7 gene sequences were synchronously selfsame. Selection pressure more significantly impacted CUB than gene mutation. Furthermore, the CUB in the cp. genes of the nine Elaeagnus species was extremely strong but with observable diversity among multifarious genes. The nine Elaeagnus species preferred using ten codons: ATT, GAA, CGA, GTT, AAA, GTA, AGT, CGT, TTA, and GGT. The rps12 gene in the cp. genome of E. mollis had extraordinarily strong CUB via both the Nc and PR2 plot methods. Clustering outcomes based on the RSCU and cp. gene sequences were generally accordant, both of which could reveal the evolutionary relationship to a degree. In this work, it was suggested that the cp. genome of A. thaliana should be selected as the optimal heterologous expression receptor plant to obtain a higher expression efficiency in the following research on the cp. genes of the Elaeagnus species. Nevertheless, particular genes require further analysis due to the apparent distinctness of CUB in the different cp. genes of the Elaeagnus species.