Introduction

Plants of Malus species are of important economical genus, mainly distributed in the North Temperate Zone (Amandine et al. 2014). Among them, about twenty sub-species are endemic to China (Yan et al. 2019; Naizaier et al. 2019; Li et al. 2020). Malus species are well known for their edible value (Zhang et al. 2018), and medicinal functions of compounds in them. Furthermore, Malus species are of great significance to scientists for their excellent horticultural trait and ornamental value (Bao, et al. 2016; Xun, et al. 2021). Wildly cultivated Malus species appeared very often recently because they could be easily cultivated (Svetlana et al. 2013). Therefore, these traits make the Malus species as a series of model plants for study. As an important organelle, chloroplast genome has a quadripartite organization, which has been widely used in studying evolutionary traits. In addition, chloroplasts are maternally inherited organelles in all green plants (Liang et al. 2019). Studies have shown that most families of photosynthetic eukaryotes emerged due to the takeover of a free-living photosynthetic eukaryote by the host (Supriyo et al. 2020). Further, many factors have been verified to be the factors that affect the codon usage pattern, such as mutation pressure (Sharp et al. 1988), gene length (Tao et al. 2009), natural selection pressure (Shackelton et al. 2006), tRNA abundance, etc. (Pandey et al. 2020). Codon usage pattern, including the effective number of codons (ENC), the parity rule 2 (PR2) of G3 vs. A3, the relative synonymous codon usage (RSCU), the frequency of optimal codons (FoP), etc., of the chloroplast genomes may be useful for exploring their molecular evolution characteristics and predicting the expression level of a certain chloroplast gene (Mazumder et al. 2020; Yan et al. 2022).

The importance of codon usage pattern in chloroplast genomes had been emphasized by many studies (He et al. 2016; Challabathula et al. 2018; Gichira et al. 2019). Extensive studies on codon usage pattern of chloroplast genomes in the past have revealed that the variation in chloroplast genomes due to varying degrees of mutation pressure, selection pressure (Yang, et al. 2018; Haruo et al. 2016; Kong et al. 2017; Xu et al. 2011) or cultured pressure from humans. Comparative analyses have been used to examine the codon usage pattern of genomes within or between the two groups of plants (Liu et al. 2020b). Genetic diversity of Malus species had been studied by considering the microsatellite markers to assess their evolutionary range. However, sample quantities of most previous studies were about ten, meanwhile, evolution features of specific genes in chloroplast genomes had not been considered (Li et al. 2016). There are about 80 coding genes in a Malus chloroplast genome. It was believed that all genes in chloroplast genome may use universal genetic code (Nakamura et al. 2007), showing that a wide diversity may exist in Malus species (Mazumdar et al. 2017). Knowledge of the codon usage patterns of chloroplast genomes in Malus species would be very useful for exploring mechanism of environmental adaptation and molecular variation under human cultivation pressure. And now, there are still some issues, such as the codon usage diversity of each typical gene in chloroplast genomes of Malus species, the evolutionary pressure that affect the components in chloroplast genomes of Malus species, etc., need to be studied. In the present study, based on the previous studies, all 55 chloroplast genomes in the NCBI database, covering a total of 20 subgenus species of Malus, were considered and analyzed. We performed a comparative analysis on the codon usage pattern and their evolution of 20 Malus species. We calculated the ENC, Fop, CBI values, as well as the ENC-GC3s values, the G3%–A3% values, the RSCU values, etc. The correlation values among them were also explored in the present study. Furthermore, the genetic relationships among 20 Malus chloroplast genomes were conducted via correspondence analysis, and their characteristics were analyzed.

Materials & Methods

All searching results via indexing the keywords ‘Malus Chloroplast complete genome’ in the NCBI database were considered, getting a total of 55 chloroplast complete genomes (Supplementary Tab.S1), including 20 kinds of Malus sub-species. Among of them, all 20 kinds of Malus sub-species were selected, as their names and accession numbers were Malus angustifolia (MN061984.1), Malus baccata (KX499859.1), Malus coronaria (MN068247.1), Malus domestica (MK434916.1), Malus doumeri (KX499861.1), Malus florentina (KX499862.1), Malus halliana (MT246302.1), Malus hupehensis (NC_040170.1), Malus ioensis (MN062004.1), Malus micromalus (MF062434.1), Malus prattii (NC_043902.1), Malus prunifolia (NC_031163.1), Malus sieboldii (MT593044.1), Malus sylvestris (MK434921.1), Malus toringoides (MT483999.1), Malus transitoria (MK098838.1), Malus trilobata (NC_035671.1), Malus tschonoskii (KX499864.1), Malus x atrosanguinea (MN061983.1), Malus yunnanensis (MH394387.1). The genes of ycf1, ycf2, ycf3, psaA, psaB, psbA, psbB, psbC, psbD, rpoC1, rpoC2, rps3, rps8, rps14, rps18, cemA, and ccsA in each strain were studied and compared specifically. The criteria for choosing the sequences were: (1) longer than three hundred bases, (2) starting with ATG, (3) the count of the bases is divisible by three, and the sequences which contain ambiguous bases were excluded.

The parameters of componential content T, G, A and C of chloroplasts were counted. Furthermore, the second parity rule (PR2) of the third position of each separate genes, with AT-bias [A/(A + T)] as the y-axis and GC bias [G/(G + C)] as the x-axis in a graphical presentation (McLean et al. 1998; Sueoka, 1999), was used to evaluated the codon bias in all concerned coding sequences. Neutrality plot analysis was used to compare the role of mutation pressure and natural selection pressure (Sueoka, 1988). The effective number of codons, denoting the absolute codon usage pattern in coding sequences, was used to quantify the codon usage bias in the 20 Malus chloroplast complete genomes. The ENC of each gene was actually calculated via using the following formula in the present study:

$$ENC^{calculated} = {2} + \frac{{9}}{{\overline{f}_{2} }} + \frac{{1}}{{\overline{f}_{3} }} + \frac{{5}}{{\overline{f}_{4} }} + \frac{{3}}{{\overline{f}_{6} }}$$
(1)

where \(\overline{f}_{k}\)(k = 2, 3, 4, 6) denotes the average homozygosity for the amino acid class whose degree of codon degeneracy is k, it is the mean value of fk for the k-fold degenerate amino acids, and the fk is calculated by the following equation.

$$f_{k} = \frac{{n\sum\limits_{i = 1}^{k} {\left( {n_{i} /n} \right)^{2} - 1} }}{n - 1}$$
(2)

where ni is the total number of occurrences of the i-th codon for that amino acid. To elucidate the relationship between GC3s and ENC values, the expected ENC values for different GC3s are calculated as ENCexpected = 2 + s + {29/[s2 + (1−s)2]}, Where s represents the given GC3s. The RSCU values of all Malus coding sequences in chloroplast genomes were calculated by following the equation (Xu et al. 2017):

$$RSCU = \frac{{g_{ij} }}{{\sum\limits_{j}^{{n_{i} }} {g_{ij} } }} \cdot n_{i}$$
(3)

where gij is the observed number of the i-th codon for the j-th amino acid, which has ni kinds of synonymous codons. Further, the correspondence analysis (COA) was used to explore the principal component coefficients among 20 chloroplast complete genomes of Malus species based on the RSCU values (three stop codons and AUG, UGG were excluded). Evolutionary distance among 20 chloroplast complete genomes of Malus species was further studied via considering Euclidean distance among them. All the previous parameters in the present study were calculated on the Matlab 2010b software (Li et al. 2021a).

Results

Many factors affect plant genomes in the evolutionary process (Saurabh et al. 2019). In the present study, the basic components of typical genes in all chloroplast genomes are counted (Supplementary Tab.S2). The ENC values of all genes in 20 chloroplast genomes of Malus species were calculated and plotted against GC3s (Fig. 1A), the PR2-bias (Fig. 1B) and the overall ENC values for each genome were plotted (Fig. 1C) to examine the evolutionary forces. The results showed that the ENC values of most genes are generally greater than 35 in chloroplast genomes of Malus species, the ENC values of the genes for ribosomal protein large subunit 16 (rpl16) in Malus prattii (NC_043902.1), Malus prunifolia (NC_031163.1), Malus micromalus (MF062434.1), Malus halliana (MT246302.1) and Malus hupehensis (NC_040170.1) are all equal to 34.931, and the ENC values of rpl16 codon genes in Malus trilobata (NC_035671.1), Malus yunnanensis (MH394387.1), Malus transitoria (MK098838.1), and Malus ioensis (MN062004.1) are equal to 34.581.

Fig. 1
figure 1

Evolutionary forces in 20 chloroplast genomes of Malus species. A The effective number of codons (ENC) vs. GC3s dot values of all genes in 20 chloroplast complete genomes of Malus species. The curve in the figure denotes the expected ENC values for genes without any evolutionary constraint. B The PR2-bias plot of all genes in 20 chloroplast complete genomes of Malus species. There are 1154 selected genes in total in 20 chloroplast complete genomes of Malus species. C Overall ENC values of chloroplast for each genome of Malus specie

In order to explore the biased codon choices, the relation between G and C content, and between A and T content at the third position of genes in the 20 chloroplast genomes of Malus species were shown by the PR2 bias plot (Fig. 1B). The results show that A and C are preferred bases, but it’s not obvious, showing that the bases at the third position are affected by mutations in general, and the pressure effect of natural selection is not dominant. From Fig. 1C, the overall ENC values of the 20 chloroplast genomes of Malus species, more codon usage bias exists in M. yunnanensis, M. ioensis, M.doumeri and M. florentina chloroplast genomes.

In order to evaluate the relationships among the codon usage pattern (such as the content of G, C, A and T, the ENC values, the codon adaption index, and the Fop values, etc.) of the genes in all 20 chloroplast genomes of Malus species, correlation analysis was performed as shown in Fig. 2. The content of GC3 exhibits strong positive correlation with gene length and the ENC values, suggesting that the sequence lengths may be an important factor contributes to codon usage bias in the chloroplast genomes, and that the GC3 may be a result of the codon selection in the evolutionary process of chloroplast genomes. The GC content in 20 chloroplast genomes of Malus species is mainly contributed by the GC12, the correlation value between them is 0.953, while the correlation value between GC and GC3 is 0.363.

Fig. 2
figure 2

Correlation analysis among the codon usage parameters

The CBI value of a gene was usually regarded as an effective measurement of codon bias (Deb et al. 2018). It could measure the extent of the usage pattern of a subset of optimal codons in a gene. In an extreme bias gene, the CBI value will be equal to 1 (Maldonado et al. 2018). Uncommonly, from Fig. 3A, in the chloroplast complete genomes of Malus species, it shows weak correlation between the CBI and the ENC values. This is mainly because the variability range of the ENC is much broader than that of the CBI. Most of the CBI values are distributed in the range of 0.25 to 0.4. In fact, the ENC values should be normalized to the scope of the range within CBI to reveal the explicit relationship between them as a similar method was used in our previous studies (Li et al. 2018). Here, if the scope of the ENC values were equalized to that of the CBI values, their correlation was ENC (Equalization) = − 0.296 × CBI, showing that there is a strong negative relationship between them (Supplementary Fig.S1). Further, there is also a strong positive correlation between codon bias index and the frequency of optimal codons, denoting that the codon usage pattern in the chloroplast complete genomes of Malus species may be shaped by the frequency of optimal codons in the process of evolution (Wei et al. 2014). The GC content of a gene plays an important role in determining the effects of base composition bias (Supriyo et al. 2017). The neutrality plot was also performed to explore the directional mutation pressure versus natural selection in chloroplast complete genomes of Malus species. The relationship between GC12 and GC3 for sequences is shown in Fig. 3B; each point shown in the figure represents a separate gene. It shows that the GC12 of genes were distributed to the range of 30 to 55%, and the GC3 of genes were distributed to the range of 20 to 40%. Rates of GC3 are less than GC12 in all genes. In Fig. 3C, content of GC, GC12, and GC3 in all chloroplast genomes of Malus species were displayed. The third bases of all codons are AT-rich. The overall GC rates for whole length of all genes are all less than 50%. The number of genes with GC12 content larger than 50% is 99. In Fig. 3D, relationships between protein length and GC content (both GC3 and GC12) were described. The GC3 contents are higher than the GC12 contents in all genes. Meanwhile, from the figure, it could be seen that the length of the sequences have no obvious relationships to their GC contents. From this perspective, the natural selection is a greater impact factor on the rates of GC contents in chloroplast genomes of Malus species.

Fig. 3
figure 3

CBI vs. ENC, neutrality of GC content, gene number vs. GC content, and the GC content vs. protein length analysis of the 20 chloroplast genomes of Malus species. A The effect of the ENC on the CBI of all genes in the 20 chloroplast genomes of Malus species. B The neutrality analysis. C Statistical analysis of the overall GC, GC12 and GC3. D Relationship between GC12 / GC3 ratio and protein lengths. The area graphs in the sub-figures denote number density of corresponding genes

o reveal the codon usage pattern of RSCU values in 20 Malus chloroplast genomes, we performed the RSCU analysis and counted the codon quantity of all genomes including the stop codons (UAA, UAG and UGA) and the one-dimensional degenerate codons (AUG and UGG) as shown in Fig. 4. The RSCU results show that the abundant codons with RSCU values more than 1.5 covering UUA, GUA, UCU, CCU, ACU, GCU, UAU, CAU, CAA, AAU, GAU, AGA, and GGA, and that the less-abundant codons with RSCU values less than 0.5 including CUG, CUC, GAC, AGC, UAC, GCG, GGC, CGC, GUC, ACG, CAC, CGG, CAG, AAC, and UGA. Among the stop codons, more than half of them select the UAA as their terminal codons. Meanwhile, among all the RSCU values, the RSCU values of UUA, GCU and AGA are the most preferred ones. Each codon quantities of concerned genes in 20 Malus chloroplast genomes were also counted and shown in Fig. 4, and the results showed that the most used codons in the chloroplast genomes are AAU (15,824), GAU (16,968), UUA (16,690), and UAU (14,943), etc. While the least used codons are UGC (1445), CGC (2104), CGG (2278), and AGC (2392), etc., except for stop codons UGA (192), UAG (283), and UAA (679) with no corresponding amino acid.

Fig. 4
figure 4

RSCU values of whole genomes of 20 chloroplast complete genomes of Malus species

Genetic differences among 20 chloroplast genomes of Malus species are shown in Fig. 5 via calculating the correspondence based on the RSCU distances of 20 chloroplast complete genomes of Malus species. The relative and cumulative inertia of correspondence analysis factors are also shown in the inner graph of Fig. 5. The previous four axis factors are 36.01%, 31.57%, 15.21% and 5.86%, it could be seen that the previous two axes would explain the evolutionary distances among the Malus species well. In the correspondence analysis, the terminal codons and the codons for Met and Trp were excluded.

Fig. 5
figure 5

Correspondence analysis for 20 chloroplast genomes of Malus species (Axis 1 and Axis 2). The inner graph shows the relative and cumulative inertia versus factors

The Euclidean distances of RSCU values among 20 chloroplast genomes of Malus species were used to explore their clustering characteristics (Fig. 6). There are four clusters when the Euclidean distances equals to 0.06 while all 20 Malus chloroplast genomes were considered. Here, for conformity and consistency of the data, the RSCU value of every genome contains the same gene sequences.

Fig. 6
figure 6

Clustering analysis for 20 Malus chloroplast genomes

All 20 coding genes in chloroplast genomes of Malus species were calculated, and the results were shown in the Tab.1. From the results, the genes of yeast cadmium factor 1 (ycf1) are of the largest divergences. The standard deviation of its codon usage parameters are all the largest ones. Among all the genes concerned, they differ greatly in codon usage preference, for instance, GC12 of psbB and rpl16 are greater than 50%, while other GC content, whether the GC12 content, overall GC or GC3 are less than 50%. Therefore, the genes of yeast cadmium factors, including ycf1, ycf2 and ycf3, are affected by mutations more obviously than other factors.

Table 1 Codon usage of several typical genes in chloroplast genomes within Malus species

Summation of standard deviation values of all parameters, including A, G, C, T, G3, C3, A3, T3, GC12, GC3, CBI, ENC and Fop of each gene, as shown in Fig. 7, were used to explore their whole codon usage divergence properties. From the graph, codon usage pattern and basic components in rps7 in all genes are same to each other. Some other genes, such as the psbA, psbD, and ycf2 are relatively lower than other genes. The codon usage divergence of a gene is an indicator for estimating its diversity property. The codon usage divergence of rps12 is the largest one according to Fig. 7, however, from the Supplementary Tab.S1; the encoding sequences are of two different lengths, resulted the larger deviation value of the ENC. If the two kinds of sequences were considered separately, the codon usage divergence would be lower than it is shown in the present study.

Fig. 7
figure 7

Codon usage divergence of the specific genes in Mulas chloroplast genomes. Values of the bars in the graph represent the divergence degree of the genes, which equal to the summation of standard deviations for both the codon usage patterns and sequence compositions in the particular genes

Discussion

It has been proposed that codon usage pattern in coding sequences is very useful for the identification of plants (Hosokawa et al. 2004). DNA sequence data from rpl16 in chloroplasts were used to address phylogenetic relationships among the major lineages of the grass family for their non-conservative evolution (Zhang et al. 2000). In present study, all ENC values of ycf3 codon genes are equal to 61.122 except Malus coronaria (MN068247.1), which ENC value is equal to 60.206. According to the previous perspective, the rpl16 gene in chloroplast genomes of Malus species is greatly influenced by the selection pressure. On the contrary, the codon usage pattern in ycf3 genes is mainly affected by mutation pressure. The ycf3 protein is very important for accumulation of the photosystem I complex. Previous studies have shown that the sequence of ycf3 was conservative in its evolutionary process (Naver et al. 2001).

Four canonical bases-A, C, G and T should be used proportionally if mutation is the absolute cause of codon bias in a gene (Li et al. 2014). On the contrary, natural selection for codon choice in a gene would cause uneven use of G, C, A and T. In the present study, the chloroplast genomes of Malus prunifolia and M. micromalus show relative light codon usage bias in their genomes. The present study shows the component biases for the third bases between G3 and C3, and between A3 and T3 in 20 Malus chloroplast genomes are similar, all of them are affected slightly by both mutation pressure and natural selection. Actually, the mutational pressure was considered as a major factor in shaping the codon usage pattern compared with natural selection (Jenkins et al. 2003).

When the genetic diversity is concerned, it is more important to study the codon usage pattern of specific genes in the chloroplast genomes of Malus species, especially the differences of proteins that derived from the variation of nucleosides sequences (Li et al. 2022). The codon usage pattern of certain genes in chloroplast can reveal their conservative characteristics (Tan et al. 2020). According to the previous findings, the ycf1 could be used as a fragment to identify species of land plant, for some regions in the ycf1 gene were the most variable loci (Dong et al. 2015). Another previous research also revealed that the ycf1 gene was a vulnerable gene (Koh et al. 2006). Chloroplast genome has an independent, highly conserved genetic system (Xiong et al. 2009). According to the CBI values listed in Table.1, the subsequence of the ycf (except ycf1) and rps genes all show high adaption in the evolution process. In fact, the ycf1 in the samples are of two types with their lengths are 5640 bp and 1083 bp. On the contrary, the gens of psbB and psbC show lower codon adaption. In the present study, genes of rps8, rps18 and rpl16 are affected by natural selection during their evolutionary process; this is consistent with the results of FoP values listed in the table showing the corresponding genes have larger FoP values (Supplementary Tab.S2) (Debadin et al. 2019). Related studies also found that the Fop value would strong correlate with gene expression in P. glauca (Torre et al. 2015). The present study also shows that the standard deviations of the codon usage parameters are important for analyzing the phenotypic divergence and codon adaption.

Codons link nucleic acids to proteins even the function of the genes (Bastolla et al. 2017). Therefore, the codon usage patterns have been used for characterizing the evolutionary distance among the genomes (Liu et al. 2020a). The similar codon preference denotes the closer genetic relationship (Sophiarani et al. 2019). The chloroplast genomes of Malus species concerned in the present study would be divided into three groups. Among all the chloroplast genomes, the Malus angustifolia (MN061984.1), Malus coronaria (MN068247.1), Malus ioensis (MN062004.1) and Malus x atrosanguinea (MN061983.1) are all from the USA and three of them gathered together with some other strains, others are all from Chinese (Fig. 6). From this perspective, the geographic location dose not affects the evolutionary relationships obviously among those chloroplast genomes.

Phylogenetics depends greatly on the sequence alignment of genomes (Li et al. 2021b). The reliable sequences and effective analytical methods are very helpful for improving the reliability of the results (Zhang et al. 2017). Our present analyses produced largely identical deep relationships among the 20 chloroplast genomes of Malus species. The results highlight the importance of codon usage pattern in studying the diversity and the evolutionary distance among the plants via using their chloroplast genomes. All the diversity degree were calculated based on the ratio of the base pairs and the ratio of codons, therefore, the results shown in Fig. 7 depend strictly normalized data: on the nature of the genetic dispersion among the sequences, not the lengths of the gene sequences.

All data of chloroplast genomes of Malus species derived from the NCBI database were took into account, and 20 strains of them covering all kinds of the species were selected. Codon usage patterns both in overall genomes and in several certain typical genes, as well as evolutionary relationships among them, were calculated and analyzed in the present study. From the results, we found that, (1) all chloroplast genomes of Malus species are AT rich, bases of the sequences are affected by both the mutation and the natural selection pressure. The natural selection played a major role, especially on their third bases. (2) codon usage preferences in different genes are of significant differences although their overall codon usages are similar, and (3) the evolutionary characteristics of all genomes have not show obvious regional characteristics from the correspondence analysis. Overall, the codon usage pattern of the chloroplast genomes of Malus species will facilitate phylogenic and genetic research of plant species.