Introduction

A growing body of evidence indicates that several features of the gene sequence architecture (such as synonymous codon usage, amino acid composition, polypeptide length, GC content and intron size) correlate with expression levels in prokaryotes and eukaryotes (Akashi 2001; Arunkumar et al. 2013; Chaney and Clark 2015; De La Torre et al. 2015; Duret 2000; Song et al. 2017a, 2017c; Williford and Demuth 2012). If a relationship exists between the abundance of aminoacyl-tRNA and the time required to occupy the acceptor site on a ribosome, then, genes with codons that correspond to the most abundant tRNAs may be translated faster than other genes in yeast (Arava 2003). In the Chilodonella uncinata genome, highly expressed genes are biased toward using optimal codons (Maurer-Alcalá and Katz 2016). In both Gryllus bimaculatus and Oncopeltus fasciatus, the expression levels were negatively correlated with amino acids with high size/complexity (S/C) scores (Whittle and Extavour 2015). However, in Parhyale hawaiensis, highly expressed genes had both large and small S/C values (Whittle and Extavour 2015). Short and GC-rich CDSs correlated positively with expression and optimal codon usage bias in four monocots, fifteen dicots and two mosses (Camiolo et al. 2015). In Silene latifolia, gene expression was positively correlated with the GC content at the third codon position (GC3), but highly negatively correlated with the intron GC content (Qiu et al. 2011). Genes with short intron lengths expressed more frequently than genes with long intron length, which were expressed at low levels in both Caenorhabditis elegans and Homo sapiens (Castillo-Davis et al. 2002).

It has been reported that codon usage bias, observed in genes, responds to different stressors (Quax et al. 2015). The suppression of circadian regulation at low temperature is caused by oscillator genes associated with the circadian clock, which contain rare codons (Xu et al. 2013). In yeast, stress can induce specific tRNA-modifying enzymes, either by DNA-damaging compounds or oxidative stress, which change codon usage (Begley et al. 2007; Chan et al. 2012). However, genes that respond to environmental adaption in Ginkgo biloba preferentially had codons that ended with G or C (He et al. 2016). In rice, compared with genes with low expression, genes with high expression had higher GC content, lower ENC (effective number of codon) value, different optimal codons and a bias for codons ending with GC under drought stress (Mohasses et al. 2020). Mohasses et al. (2020) proposed that codon optimization can increase gene expression under drought stress in rice.

With the development of genetic engineering, breeders using transgenic methods expect to increase both crop yield as well as biotic and abiotic resistance. Although these methods can shorten the time to obtain an aspired phenotype, contradiction exists among codon usage patterns between exogenous genes and the host genome. In this context, the expression of exogenous genes can be affected by codon usage. Several reasons may explain this result. The use of rare codons, but not optimal codons, may affect both the speed and accuracy of translation (Akashi 1994; Chaney and Clark 2015; Chu et al. 2014). In addition, the use of rare codons may cause protein misfolding and aggregation, resulting in decreased protein activity (Chaney and Clark 2015; Mitra et al. 2016). Codon optimization can help overcome the above-mentioned problems (Gustafsson et al. 2004; Quax et al. 2015). The first task is to identify the codon usage pattern in exogenous genes and host genomes. In summary, the disclosure of codon usage patterns at a genome-wide level can be considered as the basis for transgenic research.

Many studies on codon-bias and CDS architecture have been conducted using model species such as Arabidopsis, Medicago truncatula, Populus and rice (Ingvarsson 2007; Liu et al. 2015; Morton and Wright 2007; Song et al. 2018b). However, the codon usage patterns under normal growth and stress conditions have not been compared in Arachis duranensis, which is an ancestral species of the cultivated peanut, as well as an oil and protein crop (Bertioli et al. 2011; Kochert et al. 1996; Ramos et al. 2006; Seijo et al. 2007, 2004). The recent availability of the A. duranensis genome, its comprehensive tissue specific transcriptome characterization and transcriptome response to nematode and drought stresses has enabled the systematic analysis of codon usage patterns in this species (Bertioli et al. 2016; Clevenger et al. 2016; Dash et al. 2016). Using available genome sequences and RNA-seq datasets, this study investigated codon usage patterns and analyzed the relationships between the gene expression level and CDS architecture in A. duranensis. The CDS architecture includes the frequency of the optimal codon (Fop), polypeptide length and GC contents at the first (GC1), second (GC2) and third (GC3) codon positions.

Materials and methods

Sequence retrieval

To evaluate the codon usage patterns in A. duranensis, the CDSs of A. duranensis were obtained from PeanutBase (http://peanutbase.org/download) (Bertioli et al. 2016; Dash et al. 2016). To avoid biased analysis as a result by using partial sequences, the sequences were selected based on the following evaluation criteria (Song et al. 2017a, 2016): (1) CDS starting with ATG and ending with TAA, TAG, or TGA and (2) CDS without premature termination or ambiguous codons.

To assess the relationship between CDS architecture and gene expression l, the RNA-seq datasets were downloaded from the PeanutBase (https://peanutbase.org/external). The RNA-seq datasets were obtained for normal growth condition as well as in response to nematode and drought stresses. The RNA-seq datasets of the A. hypogaea cv. Tifrunner have been published before (Clevenger et al. 2016). The relevant RNA-seq data of root tissue were downloaded from PeanutBase. RNA-seq assembly used A. duranensis (AA genome) and A. ipaensis (BB genome) as reference genomes (Clevenger et al. 2016). The raw reads were obtained by a Illumina Hiseq 2500. The fragments per kilobase per million reads mapped (FPKM) were calculated using RSEM (Li and Dewey 2011).

Although a number of changes occurred on orthologous structures between the cultivated peanut and two wild forms of the peanut, high similarities were detected. The modal divergence between A. duranensis and subgenome A of A. hypogaea cv. Tifrunner was about 2.5 differences per 1,000 bp (Bertioli et al. 2020). The ratio between A. ipaenesis and subgenome B of A. hypogaea cv. Tifrunner was 2 differences per 10,000 bp (Bertioli et al. 2020). In addition, more than 98% DNA identity was identified between corresponding genes of A. duranensis and A. ipaenesis (Bertioli et al. 2019). In summary, these results indicated that gene expression levels are similar between A. duranensis (AA genome) and subgenome A and A. ipaensis (BB genome) and subgenome B under normal growth conditions because of similar gene sequences. The homoeolog expression result showed that A. hypogaea cv. Tifrunner had homoeolog expression balance in most tissues except for reproductive tissues, which showed a slightly stronger bias for subgenome B than for subgenome A (Bertioli et al. 2019). This result also indicated that homoeologs with similar gene structure had similar expression under normal growth conditions. In this study, the expression level of subgenome A from A. hypogaea cv. Tifrunner was used to represent corresponding gene expressions in A. duranensis under normal growth conditions. The gene expression level was transferred using Log2 (FPKM) as standard expression level. The transform can decrease different expression levels between subgenome A and genome A (A. duranensis). The Log2 (FPKM) values were assumed to represent the gene expression levels in root tissues under normal growth conditions in A. duranensis.

The differentially expressed genes (DEGs) in A. duranensis root tissue under drought and nematode stresses have been published before (https://peanutbase.org/gene_expression/atlas_drought and https://peanutbase.org/gene_expression/atlas_nematode), respectively (Brasileiro et al. 2015; Dash et al. 2016; Guimarães et al. 2015). All RNA-seq datasets were developed using Hi-seq 2000 (Brasileiro et al. 2015; Guimarães et al. 2015). Differential gene expression was obtained by a stress versus control comparison under drought treatment (Guimarães et al. 2015) and differential gene expression was estimated by a stress (3, 6 and 9 days after treatment) versus control comparison under nematode treatment (Brasileiro et al. 2015). These two papers were published in 2015, but the A. duranensis genome sequence was not available before 2016. These RNA-seq datasets were assembled using a de novo method before 2016. The authors re-assembled the RNA-seq using the A. duranensis genome as reference after 2016. The differential gene expression (Log2 (FoldChange)) was uploaded on the  PeanutBase. If the value of log2-FoldChange exceeded 2 or remained below -2 and if the adjusted-p-value (FDR) was less than 0.05, a gene was classified as differentially expressed using the edgeR package (Anders and Huber 2010).

Calculation of the codon index

To compare CDS architectures under normal growth condition as well as in response to nematode and drought stresses, the frequency of the optimal codon (Fop), polypeptide length, relative synonymous codon usage (RSCU), GC contents at the first (GC1), second (GC2) and third (GC3) codon positions were estimated as variables. Codon W (version 1.4, http://codonw.sourceforge.net) was used to calculate Fop, polypeptide length and RSCU. The GC1, GC2 and GC3 codon positions were estimated with an in-house Perl script. Fop was defined as the codon with the highest number of tRNA genes for its anticodon, among its synonymous codons (Lavner and Kotlar 2005). The RSCU value for a codon was defined as the observed frequency of the codon, divided by the expected frequency under the assumption of equal usage of the synonymous codons for an amino acid (Sharp and Li 1986; Sharp et al. 1986). The formulas for calculating Fop and RSCU are listed in the following (Lavner and Kotlar 2005; Sharp and Li 1986):

$$\mathrm{Fop}(g)=\frac{1}{N}\sum_{i}\mathrm{syn}\left(i\right){n}_{i}(g)$$
(1)

where ni(g) represents the count of codon i in gene g, N represents the total number of codons in g and syn(i) represents the degeneracy of the amino acid encoded by i.

$${\mathrm{RSCU}}_{ij}=\frac{{X}_{ij}}{\frac{1}{{n}_{i}}}\sum_{j=1}^{{n}_{i}}{X}_{ij}$$
(2)

where Xij represents the number of occurrences of the jth codon for the i-th amino acid and ni represents the number of alternative codons for the ith amino acid.

Correlation analysis

Spearman correlation was used to assess the correlation between two independent variables. In this study, these variables included gene expression level, differential gene expression, Fop, polypeptide length, GC1, GC2 and GC3. Spearman correlation analysis was used to assess the CDS architecture and how it affects gene expression level and differential gene expression. One-way ANOVA was used to analyze significance in the correlation analyses. For all statistical tests, a p-value below 0.05 was considered to indicate a significant difference. The JMP program was used to execute the spearman correlation analysis and ANOVA tests. Previous studies showed that the correlation coefficient is low between gene expression levels and CDS architecture (Ingvarsson 2007; Song et al. 2018c; Whittle and Extavour 2015). In this study, it was assumed that no correlation exists if the correlation coefficient was less than 0.1, despite significance of the correlation, based on previous studies (Ingvarsson 2007; Song et al. 2018c; Whittle and Extavour 2015).

Results

Similar CDS architecture in differentially expressed genes or non-differentially expressed genes under nematode and drought stresses in A. duranensis

A total of 32,725 A. duranensis CDSs were analyzed under normal growth conditions. A total of 528 and 1,113 CDS were identified as DEGs under nematode and drought stresses, respectively. The remaining 32,197 and 31,612 CDSs were identified as non-differentially expressed genes (NDEGs) under nematode and drought stresses, respectively. The average GC1 was 49.29%, followed by GC3 at 42.03% and GC2 at 40.07% in A. duranensis CDSs under normal growth conditions. The average GC content was 43.79% in A. duranensis CDSs under normal growth conditions. Therefore, the average AT content (56.21%) exceeded the average GC content in CDSs under normal growth conditions. Similar patterns were found for DEGs and NDEGs under normal growth conditions as well as nematode and drought stress conditions. The average GC contents of DEGs were 43.39% and 44.23% and the average GC contents of NDEGs were 43.80% and 43.78% under nematode and drought stress, respectively (Tables 1 and 2).

Table 1 Comparison of CDS architecture between differentially expressed genes and non-differentially expressed genes in Arachis duranensis under nematode stress
Table 2 Comparison of CDS architecture between differentially expressed genes and non-differentially expressed genes in Arachis duranensis under drought stress

The average Fop value was almost identical between DEGs and NDEGs under nematode and drought stresses. The average polypeptide length differed for DEGs, but was almost identical for NDEGs under nematode and drought stresses. The average Fop and average polypeptide length were 0.38 and 356, respectively, in A. duranensis CDSs under the normal growth condition. In DEGs, the average Fop values were 0.38 and 0.39 and the polypeptide lengths were 356 and 450 under nematode and drought stresses, respectively (Fop: Mann–Whitney test, p > 0.05, polypeptide length: Mann–Whitney test, p < 0.01). In NDEGs, the average Fop were 0.38 and 0.38 and the polypeptide lengths were 369 and 366 under nematode and drought stresses, respectively (Mann–Whitney test, p < 0.05).

A RSCU value above 1 (and below 1) indicates that the codon usage frequency exceeded (or remained below) the expectation, respectively (Sharp and Li 1987). The results of the present study identified a similar RSCU pattern for all genes under normal growth conditions and both DEGs and NDEGs under nematode and drought stresses. Twenty-six codons with a RSCU value above 1 were found and the RSCU values of the remaining 35 codons were below 1 (Fig. 1). In addition, the 26 codons with high RSCU values could also be distinguished from other codons based on their sequence composition. These codons preferentially ended with A or T. The 35 codons with low RSCU values ended more often with C or G (Fig. 1).

Fig. 1
figure 1

Codon usage frequency based on relative synonymous codon usage values in Arachis duranensis. A: Codon usage frequency in all CDSs. B: Codon usage frequency between differentially expressed genes (DEG) and non-differentially expressed genes (NDEG) in response to nematode stress. C: Codon usage frequency between differentially expressed genes and non-differentially expressed genes in response to drought stress. The scale represents RSCU

Comparison of CDS architecture and gene expression between DEGs and NDEGs under nematode and drought stresses in A. duranensis

Despite the similarity of the CDS architectures between DEGs or NDEGs, differences were found between DEGs and NDEGs under nematode and drought stresses. Under nematode stress, GC1 of DEGs was lower than that of NDEGs (Table 1, Mann–Whitney test, p < 0.01). Under drought stress, both polypeptide length and GC3 of DEGs exceeded those of NDEGs (Table 2, Mann–Whitney test, p < 0.01). In addition, under drought stress GC1 was lower in DEGs than in NDEGs (Table 2, Mann–Whitney test, p < 0.05). In summary, a consistent change was found in the GC content between DEGs and NDEGs, but the change in other CDS architectures was inconsistent between DEGs and NDEGs under nematode and drought stresses. These results indicated that nematode-stress responsive genes had low expression under normal growth conditions.

To compare the expression levels between DEGs and NDEGs under nematode and drought stresses, the expression levels of DEGs and NDEGs were estimated in the root tissue under the normal growth conditions. Different gene expression patterns were detected between DEGs and NDEGs under nematode and drought stresses. Under nematode stress, the average expression level of NDEGs (0.33) exceeded that of DEGs (-0.89, Fig. 2, Mann–Whitney test, p < 0.01). These results indicated that the genes that respond to nematode infection had a low expression under normal growth conditions. Under drought stress, the average expression level of NDEGs (0.26) was lower than that of DEGs (1.57, Fig. 2, Mann–Whitney test, p < 0.01). These results indicated that drought-stress responsive genes were high expression under normal growth conditions.

Fig. 2
figure 2

Comparison of gene expression levels between differentially expressed genes (DEGs) and non-differentially expressed genes (NDEGs) in Arachis duranensis. A: Comparison of gene expression levels between DEGs and NDEGs in response to nematode stress. B: Comparison of gene expression levels between DEGs and NDEGs in response to drought stress

Different correlations between gene expression levels under normal growth condition and CDS architectures in DEGs in response to nematode and drought stresses in A. duranensis

Correlation analysis was conducted between gene expression level and CDS architectures under normal growth conditions. The gene expression level of root tissue correlated positively with Fop, GC1 and GC3 (Table 3). This indicated that highly expressed genes had higher codon usage bias and preferentially used CDSs with higher GC1 and GC3 in root tissue. In addition, the same correlation analysis was also used to assess DEGs and NDEGs under nematode and drought stresses. The gene expression levels of DEGs and NDEGs were estimated in the root tissue under the normal growth condition. The expression levels of NDEGs correlated positively with Fop, GC1 and GC3 under nematode and drought stresses (Table 3). However, the expression levels of DEGs did not correlate with the CDS architecture under nematode stress and correlated positively with Fop, GC1, GC2 and GC3 under drought stress (Table 3). These results indicated that the expression levels of DEGs were affected by different CDS architectures under drought stress. Furthermore, the correlation between differential expression levels and CDS architecture was assessed. No correlation was found between differential expression level and CDS architecture under nematode and drought stresses (Table 3). This indicated that the CDS architecture did not affect the differential expression level in A. duranensis.

Table 3 Correlation between gene expression levels and CDS architecture in Arachis duranensis roots

Discussion

This study investigated the response of the wild peanut A. duranensis to drought and nematode stresses, using the RNA-seq data  and the sequenced A. duranensis genome as reference sequences. The analysis focused to find the influence of the CDS architecture on the gene expression level, in response to drought and nematode stresses. Many studies have reported codon usage biases in plants, animals, microorganisms and viruses (Behura and Severson 2012; Camiolo et al. 2015; Hershberg and Petrov 2009; Jia et al. 2015; Jiang et al. 2008; Li et al. 2016a, 2016b; RoyChoudhury and Mukherjee 2010; Song et al. 2017b). Natural selection and mutation pressure are typically considered as major causes of codon usage bias (Hershberg and Petrov 2008; Song et al. 2018b). In general, gene expression levels may increase by natural selection of codons, while codon bias exists because of non-random mutation pressures (Hershberg and Petrov 2008). The findings of the present study suggest natural selection as a major force for the codon usage bias in A. duranensis because highly expressed genes preferentially used optimal codons (represented by Fop). CDSs with GC-rich content tended to be highly expressed in A. duranensis. Such GC-rich gene sequences might waste much energy during their duplication and translation. However, Yang (2009) demonstrated that the time–cost hypothesis (rather than the energy-cost hypothesis) could provide a better interpretation of highly expressed genes. In addition, natural selection decreases the cost of biosynthesis and increases the speed of translation (Brandis and Hughes 2016; Ellegren and Parsch 2007).

This study showed that the GC contents at three codon sites follow the same trend (GC1 > GC3 > GC2) and the AT content is higher than their GC content both under normal growth and stress conditions. A previous study in eudicots indicated that the GC1 content was higher than the GC2 content and the GC3 content was similar or higher than the GC2 content (Li et al. 2016a). In addition, many studies reported that the AT content exceeded the GC content in eudicots, while opposite patterns were found in Poaceae (Glémin et al. 2014; Kawabe and Miyashita 2003; Li et al. 2016a; Singh et al. 2016).

Previous studies have demonstrated different CDS architectures or codon usage biases between DEGs and NDEGs (Quax et al. 2015). For example, in G. biloba, He et al. (2016) reported that genes that responded to environmental adaption used codons ending with G or C. The present study showed that the GC1 content of NDEGs was higher than that of DEGs in A. duranensis under nematode and drought stresses. In addition, the RSCU values showed that high frequency codons preferentially ended with A or T in A. duranensis CDSs under nematode and drought stresses. Recently, Sidorenko et al. (2017) demonstrated that high GC content in Arabidopsis CDSs positively impacted transgene expression by decreasing the accumulation of small RNA and DNA methylation. In Oryza sativa, the GC3 content was negatively correlated with gene methylation (Elhaik et al. 2014). In the present study, in response to drought stress, the gene expression of DEGs was positively correlated with the GC content. In summary, these results showed that a reasonable increase in GC content, but not in GC1 content, may contribute to transgene expression when A. duranensis drought resistance genes are translated into other plants.

The gene expression pattern between DEGs and NDEGs differs in A. duranensis. The gene expression of DEGs in response to nematode stress is lower than that of NDEGs under normal growth condition. However, the gene expression of DEGs in response to drought stress is higher than that of NDEGs under normal growth condition. In a previous study, we found that the LRR-containing genes had low expression levels since they often acted as receptors of pathogen elicitors, which in turn triggered a cascade of defense responses that culminated in plant resistance (Song et al. 2018a). In this context, it is important to consider the environment from which A. duranensis originates: regions with low rainfall (699 mm/year) and with an average rainfall of approximately 1,050 mm/year (Leal-Bertioli et al. 2012). Therefore, A. duranensis has adapted to an area with erratic rainfall. In addition, the higher number and expression levels of A. duranensis genes in response to drought stress (in comparison to nematode infection) indicates the severe impact of drought on plants. Pathogen attack tends to trigger more specific and time-restricted responses.

The debate about the correlation between polypeptide length and gene expression level is ongoing. In this study, the gene expression levels did not correlate with polypeptide length for nematode- and drought-related DEGs and in all sequences under normal growth conditions. Previous studies identified different correlations between gene expression and sequence length. For example, no correlation was found between protein length and gene expression levels in Arabidopsis (Duret and Mouchiroud 1999). Highly expressed genes tend to encode short proteins in both Populus and rice (Ingvarsson 2007; Yang 2009). However, in both Tribolium castaneum and Picea, long protein-encoding genes were highly expressed (De La Torre et al. 2015; Williford and Demuth 2012).

The results of this study showed that the levels of differential gene expression did not correlate with the CDS architecture. However, the differential gene expression patterns of the non-Toll interleukin receptor, nucleotide-binding site and leucine-rich repeat (nTNL) genes (all of which are involved in the response to pathogens (Dangl and Jones 2001)) correlated with the number of introns and the GC content in Arabidopsis, Medicago, soybean, Populus and rice (Nepal et al. 2017). These results indicated no correlation between any gene’s ability to respond to environmental stimuli and CDS architecture in A. duranensis.

Conclusion

In the present study, the CDS architectures of DEGs and NDEGs were compared and the relationship between CDS architecture and gene expression levels was investigated. The GC1 content differed between DEGs and NDEGs under both drought and nematode stresses. No correlation was found between differential gene expression and CDS architecture, neither under nematode nor under drought stress. These results provide a theoretical foundation for transgene analysis. Codon optimization can be ignored when exogenous genes are transferred into the A. duranensis genome.