Introduction

The Asteraceae is one of the largest families of flowering plants, consisting of more than 23,000 species across 1620 genera (Bremer 1994). Species in this family are distributed globally, which are present from the polar regions to the tropics and can adapt to various natural environment and habitats (Funk et al. 2005). Although a remarkable number are shrubs, trees, and vines, members in Asteraceae family are mostly herbaceous. It has been documented that Asteraceae species displayed high variation in morphological, physiological, and biochemical traits, such as secondary chemistry (Carlquist 1976) and chromosome numbers (Barker et al. 2008). The phenotypic and species diversity in this family provides the ideal resource to study the species diversification and phylogenetic evolution. Furthermore, this family also has a notable economical and ecological significance, which includes the members of important oil crops, herbal plants, ornamentals, and horticultural species as well as some hazardous invasive weeds worldwide (Lundberg and Bremer 2003).

Chloroplasts (cp) are organelles in eukaryotic autotrophic organism cell, whose main role is to carry out photosynthesis and supply the essential energy for all the organisms living in the world. They have their own genomic and genetic system to replicate and transcribe their own genome and usually exhibit maternal inheritance. In land plants, the cp genome is highly conserved (Chumley et al. 2006), with a typically circular organization containing a pair of inverted repeat regions (IRs) separated by a large single-copy (LSC) and a small single-copy (SSC) region (Palmer 1991). The highly conservative structure of cp genome makes it suitable for comparative analysis from distant and closely related species (Raubeson and Jansen 2005). It has been demonstrated that comparative analysis of cp genomes could not only provide the valuable information to understand the structure and organization variations of cp genomes but also facilitate to reveal the molecular evolution and diversification process of plants (Liu et al. 2013; Rivas et al. 2002).

With the emergence and development of high-throughput sequencing technology, more and more chloroplast genomes of Asteraceae species have been dissected recently (Nie et al. 2012; Liu et al. 2013; Bock et al. 2014). The available completed genomes provide an opportunity to perform comparative analysis within these members of Asteraceae family at the whole chloroplast genome level. Although extensive studies have reported the comparative chloroplast genome analysis in many families (Wu et al. 2011; Ghimiray and Sharma 2014), a systematic analysis of the structure and organization of chloroplast genomes in Asteraceae family has not been performed up to now. Here, eight chloroplast genomes belonging to different tribes of Asteraceae family are selected as the representative to perform comparative analysis, including Ageratina adenophora (NC_015621) belonging to the tribe Eupatorieae (Nie et al. 2012), Artemisia frigida (NC_020607) (Liu et al. 2013), Chrysanthemum indicum (NC_020320), and Chrysanthemum × morifolium (NC_020092) all belonging to the tribe Anthemideae, Guizotia abyssinica (NC_010601) (Dempewolf et al. 2010) belonging to the tribe Millerieae, Helianthus annuus (NC_007977) (Timme et al. 2007) belonging to the tribe Heliantheae and Jacobaea vulgaris (NC_015543) (Doorduin et al. 2011) belonging to the Senecioneae clade as well as Lactuca sativa (DQ383816) (Timme et al. 2007) belonging to the tribe Lactucinae, with the purpose to compare and analyze the gene content, genome structure, and RNA editing of Asteraceae cp genomes, which will provide helpful information for better understanding of cp genome evolution within this family.

Materials and Methods

Comparison of Genome Content and Organization of Asteraceae cp Genomes

Complete cp genome sequences of these eight Asteraceae species were downloaded from The Chloroplast Genome Database (http://chloroplast.ocean.washington.edu/). The gene content and genome organization information was obtained from the available annotated files. The sequences of the remaining seven genomes were aligned with that of H. annuus using the program ClustalW2 (https://www.ebi.ac.uk/Tools/msa/clustalw2/) with default settings to identify the position of LSC, SSC, and IR regions, respectively. The size, gene content, order, and organization of all the eight Asteraceae cp genomes were compared with each other manually.

Structural Comparison of Asteraceae cp Genomes

The gene order conservation (dot-plot analyses) among the eight Asteraceae cp genomes were visualized using the program Mulan (http://mulan.dcode.org/) (Ovcharenko et al. 2005). The alignments were performed with the annotation information of A. adenophora cp genome which was represented as reference. The evolutionarily conserved sequences were detected using a threshold of at least 70 % identity over 100 bps. At the same time, the structural variations of the Asteraceae cp genomes were further compared using the Mauve software (Darling et al. 2004) with the cp genome of Nicotiana tabacum (NC_001879) (Shinozaki et al. 1986) as the reference sequence.

Sequence Variation Analysis and Marker Identification

The program mVISTA (http://genome.lbl.gov/vista/mvista/) (Frazer et al. 2004) in Shuffle-LAGAN mode was used to perform the structural comparison of A. adenophora cp genome with the remaining seven Asteraceae cp genomes with the sequence annotation information of A. adenophora.

Then, the intergenic regions with high sequence diversity were selected for phylogenetic analyses among the eight Asteraceae species. All the sequences were downloaded from GenBank. ClustalW tool integrated into BioEdit software (Hall 1999) was utilized to align the concatenated sequences of each region which were subsequently edited manually. PAUP* (Swofford 2002) was used to construct maximum parsimony (MP) trees with 1000 bootstrap replicates to estimate MP branch support values.

Prediction of the RNA Editing Sites in the Asteraceae cp Genomes

All the protein-coding genes of the eight Asteraceae cp genomes were downloaded from the NCBI database (www.ncbi.nlm.nih.gov) according to their annotation information. Predictive RNA editor for plants (PREP)-Cp tool (http://prep.unl.edu/cgi-bin/cp-input.pl) (Mower 2009) was used to predict the RNA editing sites in these chloroplast genes with default parameters. To validate the prediction, the expressed sequence tags (EST) sequences of C. × morifolium, G. abyssinica, H. annuus, and L. sativa were obtained from the Compositae Genome Project Database (http://compgenomics.ucdavis.edu/) and then the protein-coding genes of these four species were searched for similarity by BLAST tools against their respective EST database. Significant hits were examined manually, and only the base-pair differences C to T were considered as the RNA edited sites.

Phylogenetic Analysis

A set of 81 cp genes from 70 taxa including these eight Asteraceae species, 59 other angiosperm lineages, and three gymnosperms were used to infer the phylogenetic relationships among them (Online Resource 1_ESM_1 and Online Resource 1_ESM_2). The database MSWAT (http://mswat.ccbb.utexas.edu/) (Jansen et al. 2007) was used to generate an alignment of these sequences. Maximum likelihood (ML) searches were performed to find best trees with RAxML-HPC BlackBox which was accessible from the CIPRES Science Gateway (Miller et al. 2010). We estimated proportion of invariable sites (GTRGAMMA + I) and let RAxML halt bootstrapping automatically (as highly recommended by the online program). The PAUP* (Swofford 2002) was used to conduct phylogenetic analysis using maximum parsimony (MP) with the above parameters. Nuphar advena and Nymphaea alba were served as outgroups in both ML and MP searches.

Results and Discussion

Comparison of Genome Content and Organization of Asteraceae cp Genomes

GC (guanine-cytosine, G+C) content was firstly evaluated across the eight Asteraceae cp genomes. It was found that there was little variation in the GC content of complete genome with the values ranging from 37.3 to 37.6 %. The GC content of all the protein-coding genes was also calculated, and no significant variation was observed among these genomes. In addition, the cp genome of J. vulgaris was found to have the lowest GC content in both complete genome and coding regions, while H. annuus and G. abyssinica had the highest genome content in complete genome, and A. adenophora had the highest GC content in coding regions.

Then, the genome sizes of the eight Asteraceae cp genomes were compared and analyzed. With an average size of 151 kb, the cp genomes varied from 150,686 to 152,772 bp in length (Table 1). All the eight genomes contained two copies of IR regions, with the range from 23,755 to 25,034 bp. The IRs were separated by a LSC region (82,718 ∼ 84,829 bp) and a SSC region with the size of about 18 kb, respectively. The sizes of complete genome, SSC region, and IRs of L. sativa were the largest among the eight Asteraceae species, while A. adenophora had the largest LSC region and the smallest IR regions. It has been reported that the genome size variation of cp genome was due to the length difference of LSC and IR regions (Chung et al. 2006), while in Asteraceae family, the size of intergenic regions contributed mainly to the variation of cp genome size.

Table 1 Comparisons of CpDNA features among eight species in Asteraceae family

The gene content of the eight Asteraceae cp genomes was further investigated and compared. Results showed that they were conservative although some minor variation was present. The Asteraceae cp genomes contained approximately 114 unique genes, including about 80 protein-coding genes, 29 transfer RNA (tRNA) genes or so, and 4 ribosomal RNA (rRNA) genes (rrn4.5, rrn5, rrn16 and rrn23, all duplicated in the IRs). H. annuus, A. adenophora, J. vulgaris, and L. sativa shared the same 81 protein-coding genes. The number was larger than that of A. frigida, which is absence of psbJ gene, as well as C. indicum, C. × morifolium, and G. abyssinica, all lacking ycf15 gene (not duplicated by the IRs). Protein-coding genes made up 48.8 % (A. adenophora and C. indicum) to 52.1 % (A. frigida) of the genome and the remaining were tRNA genes, rRNA genes, introns, intergenic spacers, and pseudogenes.

In cp genome of higher plants, the rps12 gene is trans-spliced, with one of its exons located in the LSC region and the other duplicated by the IRs (Howe et al. 2003). In this study, we also found that the all the eight Asteraceae cp genomes were in accordance with this feature. A total of 17 intron-containing genes were found in the Asteraceae cp genomes, including 11 protein-coding genes and six tRNA genes, almost all of which were single-intron genes with the exception of ycf3 and clpP, both having two introns in A. adenophora, G. abyssinica, H. annuus, J. vulgaris, and L. sativa. However, rpoC1 also had two introns in A. adenophora. Compared with H. annuus, there were 12 protein-coding genes and five tRNA genes, equally making to a total of 17 intron-containing genes present in C. × morifolium plastome. It had another two-exon gene, ycf2, but lacked the trnG-UCC with one intron. In addition, A. frigida and C. indicum also had another two-exon gene, ycf2 and rpl16, respectively.

Furthermore, we found that there were differences in the numbers, types, and relative positions of tRNA genes except the shared 24 (seven tRNA genes duplicated in the IRs) in the Asteraceae cp genomes as shown in the table (Online Resource 1_ESM_3). Apart from A. frigida and J. vulgaris, the rest contained two trnS-GCU genes with minor variances in length. A. adenophora, A. frigida, C. indicum, C. × morifolium, G. abyssinica, and J. vulgaris contained two trnG-UCC genes and one of them had an intron, while there was no trnE-UUC in A. adenophora, C. indicum, C. × morifolium, and G. abyssinica. Furthermore, trnG-UUC did not exist between trnR-UCU and trnT-GGU in C. × morifolium.

Comparison of the Genome Structure in Asteraceae cp Genomes

Previous studies have demonstrated that the Asteraceae cp genomes shared a large 22.8 kb inversion and a smaller 3.3 kb inversion nested within the region of the large one (Kim et al. 2005). In our study, both of the two inversion events were found in the plastid genome of all the eight Asteraceae species, with respect to N. tabacum (NC_001879) (Shinozaki et al. 1986) (Fig. 1). It is estimated that the two inversions may be present in all Asteraceae species as a specific feature of the Asteraceae cp genomes. The two inversions always appear together, suggesting that they occurred at the same evolutionary process.

Fig. 1
figure 1

Comparison of the genome structure of eight Asteraceae chloroplast with N. tabacum as a reference using the Mauve program. The boxes above the line represent DNA sequences in clockwise direction, and those below the line represent DNA sequences in the counterclockwise direction

Besides, the cp genome of A. adenophora was compared with that of the remaining seven species using the program Mulan (http://mulan.dcode.org/) for dot-plot analyses (Ovcharenko et al. 2005). As a result, we found that except A. frigida, there were no substantial rearrangements among the eight species of Asteraceae family, which were entirely collinear other than numerous small deletions and insertions (Fig. 2 and Online Resource 1_ESM_4). The SSC region of A. frigida is inverted, compared with other Asteraceae cp genomes. Thus, the gene order was further analyzed in the SSC region (Fig. 3) for these Asteraceae cp genomes. Results indicated that apart from A. frigida, the Asteraceae family had the same gene order in the SSC region. It began with pseudo ycf1 and then was followed by the order of rps15, ndhH, ndhA, ndhI, ndhG, ndhE, psaC, ndhD, ccsA, trnL-UAG, and rpl32 and ended with ndhF, which was completely reversed compared to N. tabacum. Nevertheless, the gene order of A. frigida were the same with that of N. tabacum, which ended with pseudo ycf1, extending into IRa region. These results suggested that the Asteraceae family may occur an inversion event in the SSC region before the divergence of species within it and then the SSC region of A. frigida lineage occurred a re-inversion event to make it have the same gene order with N. tabacum subsequently. It has been demonstrated sequence rearrangements were generally the results of recombination events (Ogihara et al. 1988). The structural variation and sequence rearrangements of the Asteraceae cp genomes will provide the vital resource for molecular evolution and phylogenetic studies.

Fig. 2
figure 2

Dot-plot comparison showing conserved and inverted regions found in both A. adenophora (x axis) and A. frigida (y axis) cp genomes. Note that the cpDNA of A. frigida has an inversion in the SSC region (red arrow)

Fig. 3
figure 3

Comparison of the genome structure among eight Asteraceae species. Genes above the green lines indicate their transcriptions in forward direction, and genes below the green lines represent their transcriptions in reverse direction

Furthermore, the exact borders between the IR regions and the two single-copy regions (LSC and SSC) among the eight Asteraceae cp genomes were compared to investigate the contraction or expansion of the IR regions (Fig. 3). In all the eight species, the border between IRb and LSC was located in the region of rps19 gene and resulted in a pseudogene at the end of IRa with the same length as far as the IRb expanded into the rps19 gene. The IRb of A. adenophora, C. indicum, C. × morifolium, G. abyssinica, and H. annuus extended approximately 100 bp into the rps19 gene, whereas A. frigida, J. vulgaris, and L. sativa had 60, 41, and 60 bp of the rps19 pseudogene at the end of IRa, respectively. In addition to the expansion into rps19 gene, the IRb region extended into the ycf1 gene was also found in the eight species except A. frigida, whose IRa region extended into the ycf1 gene because of the re-inversion event. This created a duplication of various lengths (404 ∼ 576 bp) of the ycf1 gene at the beginning of IRa region (at the end of IRb region for A. frigida). The incomplete duplications of the normal copy of rps19 and ycf1 led to a shortage of protein-coding ability. The trnH gene was located entirely within the LSC region, with various distances from the IRa/LSC border. The C. indicum has 26 bp, the longest intergenic space among these species, while H. annuus has only 2 bp. Apart from A. frigida, the ndhF gene of the eight Asteraceae cp genomes was located 0 ∼ 209 bp upstream of the IRa/SSC border, while that of A. frigida had the ndhF gene located 75 bp of downstream of the IRb/SSC border. The contraction or expansion of the IR regions may result from intramolecular recombination between two short direct repeat sequences within the genes located at the borders between the IR regions and the two single-copy regions (Maier et al. 1995).

Sequence Divergence in Asteraceae cp Genomes and Marker Identification

The completed Asteraceae cp genomes offered the opportunity to perform sequence variation analysis within the family at the whole cp genome level. Regions with highly sequence variations among the eight species were calculated and visualized using mVISTA (http://genome.lbl.gov/vista/mvista/) (Frazer et al. 2004) programs. The result demonstrated that the non-coding region was more divergent than the coding region. The coding regions with the highest nucleotide divergence in the eight genomes were scattered across the whole genome, including ycf1, ndhK, rps16, rps3, rpl22, ccsA, matK, rpoC1, and accD (Fig. 4). Some intergenic regions containing high sequence variations were also found.

Fig. 4
figure 4

Sequence alignment of eight sequenced cp genomes in the Asteraceae family. Sequences of cp genomes were aligned and compared using mVISTA program. A cutoff of 70 % identity was used for the plot, and the Y-scale represents the percent identity ranging from 50 to 100 %. Blue represents exons, green-blue represents tRNA and rRNA genes, and pink represents conserved non-coding sequences (CNS). Grey arrows indicate the direction of transcription; horizontal blue lines indicate the position of IRa and IRb; horizontal red lines indicate the position of Inv1and Inv2

To identify some new regions which could be applied to Asteraceae phylogenetic analysis, eight intergenic regions with high sequence diversity and their combined region were extracted from these genomes to perform phylogenetic analysis using maximum parsimony (MP) method (Table 2 and Fig. 5). Results found that the eight markers contained parsimony-informative characters higher than 5 %. MP analysis resulted in corresponding single trees with consistency index ranging from 0.9020 ∼ 1 and retention index ranging from 0.8148 ∼ 1. Analysis of all eight region combined sequences generated a congruent topology with high support for four internal nodes. Seven regions (cssA-trnL, psbI-trnS, rpl33-rps18, trnF-ndhJ, trnG-trnT, trnH-psbA, and trnT-trnL) possessed the completely congruent trees with the life history of species. Among them, trnH-psbA has been frequently applied as a phylogenetic maker for Asteraceae family (Doorduin et al. 2011), and psbI-trnS has also been identified by Nie et al. (2012). The remaining regions are newly identified to be used for developing molecular markers for phylogenetic analysis in Asteraceae family.

Table 2 Promising regions identified for molecular phylogenetic studies of Asteraceae by comparison of the full cp genomes of the eight species
Fig. 5
figure 5

Maximum parsimony (MP) trees of all the selected eight cp intergenic regions. The phylogram called “combined regions” in the middle is derived from MP analysis of all eight regions together

Prediction of the RNA Editing Sites in the Asteraceae cp Genomes

RNA editing is one of the most important post-transcriptional processes in eukaryotic organisms, which could alter the transcripts through nucleotide insertion, deletion, or substitution to enrich the genetic information. Identification of the RNA editing sites in chloroplasts will not only provide the vital information on the proper function of the proteins encoded by plastids but also reveal the evolutionary features of RNA editing (Tillich et al. 2006; Nie et al. 2014). To investigate the RNA editing in Asteraceae plastids, we systematically analyzed and compared the RNA editing sites in the eight Asteraceae cp genomes using the computational analysis approach (Table 3). A total of 373 editing sites were found in these cp genomes, with the average number of 47 sites every species. Further analysis found that all the editing sites were C to U conversion, which was consistent with the previous observations in seed plant plastids (Tillich et al. 2006; Chen et al. 2011). Among them, 42 editing sites in 19 genes were identified in A. adenophora, 50 sites in 21 genes in A. frigida, 49 sites in 21 genes in C. indicum, and C. × indicum, 45 sites in 19 genes in G.abyssinica, 46 sites in 20 genes in H. annuus, 44 sites in 20 genes in J. vulgaris, as well as 48 sites in 19 genes in L. sativa, respectively. Of which, 11 sites in G.abyssinica and 6 sites in H. annuus were validated by EST alignment analysis. Furthermore, we compared the RNA editing sites patterns of these plastids, and 26 sites in 12 genes were found to be shared by the eight Asteraceae cp genomes. It has been documented that the number of shared editing sites increased in closely related taxa (Chen et al. 2011). In this study, A. frigida was found to share more editing sites with C. indicum and C. × indicum compared to other species, and A. adenophora shared more editing sites with G. abyssinica and H. annuus, suggesting that the RNA editing is evolutionary conserved. Although the Asteraceae plastids appeared to have similar pattern of RNA editing, some species-specific editing sites were also found, such as rpoC2-2 site (only identified in L. sativa), which suggested that some specific evolutionary features of RNA editing were present within the genus and subfamily level in Asteraceae family.

Table 3 RNA editing sites found in the chloroplast genomes of Asteraceae family

Phylogenetic Analysis

Firstly, 81 genes were extracted from the cp genomes of A. adenophora, A. frigida, C. indicum, C. × morifolium, G. abyssinica, and J. vulgaris and then uploaded to the database MSWAT for sequence alignment. After gaps removed, a total of 62,531 characters were remained in the final dataset. MP analysis generated a single tree with a length of 169,196; a consistency index of 0.4081; and a retention index of 0.6023 (Fig. 6). Bootstrap analysis indicated that 54 of 67 nodes were supported by values 95 % and 45 of these with bootstrap values of 100 %. ML analysis of the dataset produced similar phylogenetic topologies with MP trees (Online Resource 1_ESM_5). It is observed that A. frigida, C. indicum, and C. × morifolium are falling into the tribe Anthemideae in the Asteroideae subfamily. A. adenophora, G. abyssinica, and H. annuus are clustered into the Heliantheae alliance of Asteroideae. As for the remaining two species, J. vulgaris is grouped into the tribe Senecioneae of Asteroideae, and L. sativa is located in the tribe Cichorieae in the subfamily Cichorioideae. The phylogeny obtained with the molecular data is comparable to the taxonomy based on phenotypic characteristics. The eight species in the Asteraceae family are clustered into Asterales and placed within the euasterids II, which supports a monophyly of the Asteraceae.

Fig. 6
figure 6

Phylogenetic tree reconstruction of 70 taxa using maximum parsimony (MP) based on concatenated sequence from 81 cp genes. The position of the Asteraceae family is indicated by a red box