1 Introduction

The genus Camellia, which is used worldwide as an ornamental plant and for tea, belongs to the family Theaceae (Vijayan et al. 2012; Yang et al. 2013; Huang et al. 2014). Camellia oil is less known worldwide despite its use in China as an edible oil, as well as in Japan. Camellia is one of the four main oil-bearing trees in the world, in addition to palm, olive, and coconut (Robards et al. 2009).

Through years of research and experimentation, Guangxi Forestry Research Institute(GFRI) discovered the new species C. osmantha (Ma et al. 2012a, b). C. osmantha is easy to plant, grows rapidly, and has strong cold, heat, and drought tolerance (Ma et al. 2013; Liu et al. 2013) as well as high oil yield (Wang et al. 2014). C. osmantha cv ‘yidan is recognized as a new variety of C. osmantha (Ma 2020). The plant height and crown width of 6-year-old C. osmantha cv ‘yidan was 5.39 m and 7.17 m, respectively, and the oil production of a 5-year-old plant was 0.0590 kg·m–2 (Liang et al. 2017), almost double the standard oil yield for C. oleifera cultivars (0.0325 kg·m–2). Camellia oil is also known as ‘‘eastern olive oil’’ because of the similarities in the chemical composition of Camellia and olive oils, with high amounts of oleic acid and linoleic acid, as well as low levels of saturated fats. At present, the total area of C. osmantha cv ‘yidan production is over 1500 ha, mainly in Qinzhou, Laibin, Yulin, Yunnan, and Hainan, China.

In China, the planting area of C. oleifera reaches 4,466,700 ha, and the oil production is 600,000 tons. Camellia oil production needs to be further developed. C. osmantha cv ‘yidan is a promising new species that produces 1590 kg of oil per hectare, doubling the standard oil productivity rate for C. oleifera cv ‘cenruan 3 elite cultivars (750 kg·ha–1) (Liang et al. 2017). In plants, chloroplasts play an important role in maintaining life on Earth by providing carbohydrates, amino acids, lipids, and other metabolic substances (Daniell et al. 2021). Plant oil is one of the most important products of photosynthetic carbon assimilation. Fatty acid's biosynthesis occurred early in seed-filling stage and went on until seed maturing. Then, oil accumulated rapidly in seed at late stage of seed maturing (Cao et al. 2021). Previous studies show that acetyl-CoA carboxylase (ACCase) in plastids was a key enzyme regulating the rate of de novo fatty acid biosynthesis. And the expression of the ACCase gene was directly correlated with change of lipid content (Modiri et al. 2018). Besides, the expression of oil biosynthesis-related transcription factors was influenced by the photosynthetic activity, such as WRINKLED1 (Hua et al. 2012). Therefore, research on oil biosynthesis and photosynthetic characteristic-related genes based on the whole chloroplast genome sequence of C. osmantha cv ‘yidan is of great significance for improving production. Moreover, the study of chloroplast genome genes provides a new idea for improving oil production in other oil plants.

At present, the chloroplast genome sequences of more than 20 plants in the genus Camellia have been published in NCBI, including species for ornamental purposes (Huang et al. 2013; Yang et al. 2013) and tea production and C. oleifera. The chloroplast (cp) genome is independent of the nuclear genome and exhibits maternal inheritance and semi-autonomous genetic characteristics (Guo et al. 2018). The structure of the cp genome in Camellia species is a typical four-segment, closed-loop structure, with a large single-copy (LSC) region, a small single-copy (SSC) region, and two inverted repeats (IRs) of roughly the same length (Zheng et al. 2019). Among these structural regions, the IRs are the most stable, and the LSC has a higher mutation rate than the SSC. The coding regions of genes have a slower evolution rate, which is suitable for the analysis of relationships at the family and higher levels, while the non-coding regions have a faster mutation rate (Chen et al. 2018), which is more suitable for analyzing relationships at lower levels such as genera and species (Clegg et al. 1994; Cui et al. 2019; Yang et al. 2019; Zeng et al. 2017). Thus, the characteristics of the maternal and highly conserved genes of the chloroplast genome provide favorable conditions for studying the phylogeny of plants.

Research on the chloroplast genome of Camellia plants is currently limited to the use of some chloroplast genes for phylogenetic analysis. Here, we describe the whole chloroplast genome sequence of C. osmantha cv ‘yidan and three other Camellia species using the next-generation Illumina genome analyzer platform. The three representative species have notable phenotypic differences (including pericarp thickness, fruit size, seed yield, and oil content) and are widely cultivated in southern China. This study aimed to provide more information for the classification of C. osmantha cv ‘yidan by clarifying and comparing the cp genome sequences and structural variations between C. osmantha cv ‘yidan and three closely related Camellia species.

2 Materials and methods

Sample preparation, sequencing, and chloroplast genome assembly

–Fresh and healthy leaves of four Camellia species (C. osmantha cv ‘yidan, Camellia vietnamensis cv ‘hongguo, Camellia oleifera cv ‘cenruan 3, and Camellia semiserrata cv ‘hongyu 1) were sampled and used for complete cp genome sequencing. The four Camellia species were deposited in the Camellia oil Germplasm Resource (Latitude 22°55′51″, Longitude108°20′03″). A modified CTAB method was used to extract total genomic DNA from 50 mg of fresh leaves [58]. A 270- or 350-bp insertion library was constructed for each species, using TruSeq DNA sample preparation kits (San Diego, CA 92122 USA). DNA from the 4 species was indexed by tags and pooled for sequencing in Illumina PE (2 × 150 bp) at Kunming Institution of Botany, Chinese Academy of Sciences.

A total of 72 million raw reads were generated and made available in FASTQ format. The quality of the raw sequence reads was evaluated using the software package FastQC (Andrews 2010). The software Trimmomatic v0.36 was used for removal of adapter, contaminant, low-quality (Phred scores < 30), and short (< 36 bp) sequencing reads. The remaining high-quality sequencing reads were assembled de novo using the NOVOPlasty pipeline v2.7.2 with default parameters and based on a kmer size of 39 or 23 following the developer's suggestions, where the psbA gene of C. oleifera cv ‘cenruan 3 was used as a seed input.

Chloroplast genomic annotation and sequence analyses

–The assembled genomes of four species were originally annotated using PGA (Qu et al. 2019). The annotation results of codon positions and intron/exon boundaries were manually corrected by comparing with other known homologous genes (NC_023084.1) in the Camellia cp genome. The circular structures were mapped using the OGDRAW tool (Lohse et al. 2013). By aligning the IR/LSC and IR/SSC regions with homologous sequences from other Camellia species (NC_023084.1), their exact boundaries were determined.

Variation detection and evolutionary relationship analysis.

Repeat structures including palindromic, forward, complement, and reverse repeats were searched with BiBiServ software (https://bibiserv.cebitec.uni-bielefeld.de/reputer) with a repeat size of 15 bp and 90% or greater sequence identity. SSRs within the four cp genomes were detected using MISA software (https://webblast.ipk-gatersleben.de/misa/index.php). The following parameters were set in MISA: maximum length of sequence between two SSRs to register as compound SSR for 100 bp, with the parameters set at 10 for mononucleotides, 6 for dinucleotides, 5 for trinucleotides, and 5 for tetranucleotide, pentanucleotide, and hexanucleotide repeats.

We aligned the 114 Camellia and four other oil-producing species cp genome sequences using ClustalX. Unambiguously aligned DNA sequences were used for phylogenetic analyses, but ambiguously aligned regions were excluded. Maximum likelihood (ML) analyses were conducted using MEGA7. Bootstrap support (BS) values for individual clades were calculated by running 1,000 bootstrap replicates of the data. ML Heuristic method searches were conducted with the nearest-neighbor-interchange (NNI). The genetic relationship of the four Camellia cp genomes together with 108 available Camellia (Table 1) and four other oil-producing species cp genome sequences (GenBank accession no. JF937588.1(Ricinus communis cultivar Hale), NC_016736.1(Ricinus communis), GU931818.1(Olea europaea cultivar Frantoio), and NC_013707.2) (Olea europaea cultivar Bianchera) were used to construct a maximum likelihood method (ML) tree by using MEGA 7 with default parameters (Tamura et al. 2011).

Table 1 The list of accession number of the chloroplast genome sequences used in this study

3 Results

The structure of the chloroplast genomes of four camellia species

–The complete cp genomes of C. semiserrata cv ‘hongyu 1 (GenBank accession no. OP953553), C. vietnamensis cv ‘hongguo (GenBank accession no. OP 953555), C. osmantha cv ‘yidan (GenBank accession no. OP936137), and C. oleifera cv ‘cenruan 3 (GenBank accession no. OP953554) were sequenced using Illumina sequencing technology (Fig. 1). The cp genomes of the four species are composed of a circular DNA molecule ranging in size from 156,807 to 157,005 bp, with the typical quadripartite structure consisting of two inverted repeats (IRa and IRb) and LSC and SSC regions (Table 2)

Fig. 1
figure 1

Gene maps of the C. osmantha, C. semiserrata, C. vietnamensis, and C. oleifera cp genome. Genes drawn outside the circle are transcribed clockwise and those inside are transcribed counter clockwise. Genes belonging to different functional groups are color-coded. The inner dark gray represents the GC content of the chloroplast genome, and the light gray indicates the AT content. (Lohse et al. 2013)

.

Table 2 Summary of Camellia chloroplast genome features

The C. semiserrata cv ‘hongyu 1, C. osmantha cv ‘yidan, and C. oleifera cv ‘cenruan 3 cp genomes each contain 134 genes (81 protein-coding genes, 39 transfer RNA (tRNA) genes, and 8 ribosomal RNA (rRNA) genes, as well as 6 genes with unknown functions). The C. vietnamensis cv ‘hongguo cp genome contains 136 genes (83 protein-coding genes, 39 tRNA genes, and 8 rRNA genes, as well as 6 genes with unknown functions), which includes two copies of the rpl2 gene. By contrast, rpl2 is not found in the other three species.

Among the 134 unique genes in C. semiserrata cv ‘hongyu 1, C. osmantha cv ‘yidan, and C. oleifera cv ‘cenruan 3, 15 contain one intron (petB, petD, atpF, ndhA, ndhB, rps12, rps16, rpl16, trnG-UCC, trnK-UUU, trnL-UAA, trnA-UGC, trnI-GAU, trnV-UAC, and rpoC1), and 2 contain two introns (clpP and ycf3) (Table 3). Previous studies reported that ycf3 is necessary for the stable accumulation of the photosystem I complex (Boudreau et al. 1997; Naver et al. 2001; Guo et al. 2018). Among the 135 unique genes in C. vietnamensis cv ‘hongguo, 16 contain one intron (petB, petD, atpF, ndhA, ndhB, rps12, rps16, rpl2, rpl16, trnG-UCC, trnK-UUU, trnL-UAA, trnV-UAC, trnA-UGC,trnI-GAU, and rpoC1), and 2 contain two introns (clpP and ycf3). The gene maps of C. osmantha cv ‘yidan’, C. semiserrata cv ‘hongyu 1, C. oleifera cv ‘cenruan 3, and C. vietnamensis cv ‘hongguo are shown in Fig. 1.

Table 3 List of genes in the three Camellia chloroplast genomes

Expansion and contraction of the border regions

–The border regions and neighboring genes of the four Camellia cp genomes were compared to analyze the expansion and contraction of the connected regions (Fig. 2). The cp genomic structures, including gene type, gene order, and gene number, were conserved in C. osmantha cv ‘yidanand C. oleifera cv ‘cenruan 3, while the cp genomes of C. vietnamensis cv ‘hongguo exhibited visible differences at the IRb/SSC/IRa/borders. The IRb region expanded into the gene ycf1 with 1042–1068 bp in the IRb regions (1068 bp for C. osmantha cv ‘yidan and C. oleifera cv ‘cenruan 3, 1042 bp for C. semiserrata cv ‘hongyu 1).

Fig. 2
figure 2

Comparison of the SSC, IRs, and LSC border regions among the four Camellia cp genomes. Note: SSC(Small Single-Copy Region); IRs(Inverted Repeats Region); LSC(Large Single-Copy Region)

The IRa/SSC borders displayed large differences among the four cp genomes. The gene ndhF is located at the IRa/SSC or IRb/SSC junction, with 5–65 bp gaps between ndhF and the IR/SSC junction (5, 56, and 65 bp gaps in C. semiserrata cv ‘hongyu 1, C. osmantha cv ‘yidan, and C. oleifera cv ‘cenruan 3, respectively). The ndhF and ycf1 genes in C. vietnamensis cv ‘hongguo are reversed in the IRb/SSC/IRa boundary region compared with the cp genome sequences of the other three species. ndhF in the SSC region was 56 bp from the IRb/LSC junction in C. vietnamensis cv ‘hongguo. By contrast, the IRa/LSC and IRb/LSC boundary regions were relatively conserved in the four cp genomes. The gene rpl2 formed another boundary by expanding into the IRa region in C. vietnamensis cv ‘hongguo, leading to complete duplication of the gene within the IRs (Table 3).

Long-repeat and simple sequence repeat (SSR) analysis

–We detected palindromic, forward, complementary, and reverse repeats in the four cp genomes. Overall, 50 repeat sequences were identified in all Camellia cp genomes, of which 23–24 palindromic repeats, 16–17 forward repeats, 7–9 reverse repeats, and 2–4 complementary repeats were separately found (Figure S1(A)). The lengths of palindromic repeats ranged from 19 to 79 bp, the forward repeats ranged in length from 19–42 bp, the reverse repeats ranged in length from 19–23 bp, and the complementary repeats ranged in length from 19–20 bp (Figure S1(B–E)).

In this study, we found 50, 51, 51, and 53 SSRs in the C. semiserrata cv ‘hongyu 1, C. osmantha cv ‘yidan, C. vietnamensis cv ‘hongguo, and C. oleifera cv ‘cenruan 3 cp genomes, respectively (Fig. 3). These SSRs were mainly composed of adenine (A) or thymine (T) repeats and did not contain guanine (G) or cytosine (C) repeats. Moreover, the four cp genomes only contained mononucleotide repeats ranging from 10 to 17 bp.

Fig. 3
figure 3

Number of SSR motifs in different Camellia cp genomes

Phylogenetic analysis

–We generated a phylogenetic tree using the nucleotide sequences of the cp genomes of 112 Camellia species and other oilseed crops using the maximum likelihood method (Fig. 4), and Coffea arabica (NC_008535.1) was selected as an outgroup. C. osmantha cv ‘yidan is most closely related to C. vietnamensis cv ‘hongguo and C. oleifera cv ‘cenruan 3, which belong to the section Oleifera Chang.

Fig. 4
figure 4

Phylogenetic tree of Camellia and other related oilseed species by using the maximum likelihood method

4 Discussion

In this study, we sequenced the complete cp genomes of four Camellia species and annotated their sequences. Phylogenetic studies have shown that cp genome evolution includes nucleotide substitutions and structural changes (Feng et al. 2008; Haberle et al. 2008; Guo et al. 2018).

Some studies have shown that there are introns or gene deletions in the chloroplast genome (Downie et al. 1996; Downie et al. 1991; Graveley et al. 2001;Guisinger et al. 2010; Jansen et al. 2007; Ueda et al. 2007). Introns play an important role in the regulation of gene expression (Xu et al. 2017). They can increase gene expression levels in specific locations and at specific times (Niu et al. 2011; Le et al. 2003). The intron regulation mechanism has also been researched in other species (Callis et al. 1987; Emami et al. 2013). However, no studies have analyzed the association between intron loss and gene expression. The chlB, chlL, chlN, and trnP-GGG genes were missing in the four Camellia cp genomes but were found in several other angiosperm plastomes (Jansen et al. 2007; Green 2011; Mader et al. 2018). These four genes represent synapomorphies for flowering plants(Jansen et al. 2007). We found 15 genes that contained one intron and two genes that contained two introns (ycf3 and clpP) in the C. osmantha cv ‘yidan cp genomes. The ycf3 protein is necessary to stabilize the complex of photosystem I with the light-harvesting complex I (Boudreau et al. 1997; Naver et al. 2001). We therefore speculate that intron gain in ycf3 may alter the expression of genes encoding the photosystem I assembly protein. In the next study, we will focus on the photosynthesis-related genes in the four species. The clpP gene includes two introns. The intron gain in clpP may alter the regulation of genes encoding the clp protease proteolytic subunit. This phenomenon might be due to the increased evolutionary rates.

In addition, key genes related to lipid synthesis and photosynthesis are present in the chloroplast genome or located in chloroplast, such as carboxylase (accD) (Modiri et al. 2018), ω3-fatty acid desaturases(FAD) (Raboanatahiry et al. 2021), fatty acid exporter (FAX1-1, FAX2, FAX4) (xiao et al. 2021; Li et al. 2020), and phosphoenolpyruvate/phosphate translocator (PPT) genes (Tang et al. 2022). The accD gene encodes the heteroacetyl coenzyme A carboxylase (ACCase), a key enzyme involved in plant fatty acid biosynthesis (Nakkaew et al. 2008; Wicke et al. 2011; Kode et al. 2005; Zhang et al. 2016). Maliga (Maliga and Svab 2011) showed that accD in Nicotiana sylvestris was 1539 bp long. The accD sequence lengths were 1541, 1541, 1541, and 1532 bp in C. oleifera cv ‘cenruan 3’, C. semiserrata cv ‘hongyu 1’, C. osmantha cv ‘yidan’, and C. vietnamensis cv ‘hongguo’, respectively, suggesting that this gene has been conserved in plant cp genomes. Moreover, we observed no pseudogene formation of accD in the four Camellia cp genomes, consistent with the importance of fatty acid biosynthesis for these oil-producing plants. Camelina sativa ω3-fatty acid desaturases CsaFAD7 and CsaFAD8 were located in the chloroplast, which can modify the fatty acid composition of seed oil, which is useful for genetic engineering strategies (Raboanatahiry et al. 2021). FAX1-1, FAX2, and FAX4 were both localized to the chloroplast membrane, which play critical roles in transporting plastid fatty acids for triacylglycerols (TAGs) biosynthesis during seed embryo development (Li et al. 2020). BnaFAX1-1 may simultaneously improve seed oil content, oil quality, and biological yield in B. napus (xiao et al. 2021). BnaPPT1 plays an important role in leaf membrane lipid synthesis and chloroplast development, thus affecting photosynthesis (Tang et al. 2022). Therefore, the study of lipid metabolism-related genes in the chloroplast genome provides a new approach for future molecular breeding in camellia oil.

Previous studies showed that C. oleifera cv ‘cenruan 3’ is more adapted to low light conditions compared to the other Camellia species (Ma et al. 2012a, b). And, the light saturation point of C. osmantha cv ‘yidan’ is 499.7 μmol · m−2 s−1, and this species is more adapted to high light conditions. So, the light energy utilization of C. osmantha is maybe higher. Differences in plant photosystems maybe used to improve the efficiency of light absorption and transformation and further increase plant yield (Zhang et al. 2011). As the center of photosynthesis, the chloroplast genome is of great significance for revealing the mechanism and metabolic regulation of plant photosynthesis (Fang et al. 2010; Huang et al. 2013). Seed or silique wall photosynthesis contributed to the increased seed weight and oil content (Hu et al. 2018; Liu et al. 2012). The rpoA and rpoC2 genes encode the alpha and beta subunits of plastid RNA polymerase (PEP), respectively, which is responsible for the transcription of most photosynthetic proteins. We speculate that rpoA and rpoC2 genes in the chloroplast genome play a key role in the photosynthesis of C. osmantha.

Besides, it has been shown that when using chloroplast gene fragments for species low-order unit delineation, applicable highly variable regions should first be screened in the whole chloroplast genome (Dong et al. 2012). Chloroplast molecular markers in hypervariable region analysis can explain the intraspecific divergences in the species (Lin et al. 2022; Li et al. 2022; Xiong et al. 2022). Moreover, chloroplast genomes can develop a high-resolution molecular marker for tracking population genetic diversity (Song et al. 2020). In C. vietnamensis cv ‘hongguo’, rpl2 is present and has not been found in the other three species. The gene encodes a ribosomal protein L2, which full-length sequence is 1494 bp with a 671 bp intron. The rpl2 is found in other plants of the genus Camellia, so the development of molecular markers using the rpl2 gene could be used to distinguish thee four species, but whether it can be used to differentiate them from other Camellia spp. and requires further research.

Phylogenetic relationships among four Camellia species revealed that C. osmantha cv ‘yidan is more closely related to C. vietnamensis cv ‘hongguo and C. oleifera cv ‘cenruan 3 than to C. semiserrata cv ‘hongyu 1, other Camellia species, and other oil crops. The results of this study provide an assembly of a whole chloroplast genome of C. osmantha cv ‘yidan’, which may be useful for future breeding and further biological discoveries. It will provide a theoretical basis for the improvement of Camellia oil yield and the determination of phylogenetic status.