Introduction

A characteristic feature of the circular plastid genome is its quadripartite structure, with the presence of two large inverted repeats (IRs) that separate two single-copy regions: the large single-copy (LSC) and the small single-copy (SSC) region. The IRs can vary in size even in closely related species or, in rare cases, be entirely absent (Guisinger et al. 2011; Wicke et al. 2011). The complete size of plastid genomes typically varies between 120 and 160 kbp (Wicke et al. 2011), but can range from 85 kbp in the parasitic Cuscuta (McNeal et al. 2007) to 242 kbp in Pelargonium (Weng et al. 2014). Size differences are usually the result of expansions or contractions of the IRs and, thus, the duplication of coding and non-coding regions, but are rarely due to changes in gene content, which is relatively stable in land plants (Wicke et al. 2011; Jansen and Ruhlman 2012). Strong reductions in gene content are primarily known from the plastomes of plants with parasitic (McNeal et al. 2007; Wicke et al. 2013) or mycoheterotrophic lifestyles (Logacheva et al. 2011). Considerable alterations in gene content were recently also reported from the carnivorous Lentibulariaceae (Wicke et al. 2014). By contrast, the gene order of plastid genomes in specific land plant lineages appears to be more variable. For example, extensive rearrangements were reported from the plastid genomes of Campanula (Campanulaceae, Haberle et al. 2008), Trifolium (Fabaceae, Cai et al. 2008) and Pelargonium (Geraniaceae, Weng et al. 2014). It is currently unclear if genomic rearrangements occur stochastically throughout the angiosperms or if certain lineages display a higher propensity for such rearrangements (Weng et al. 2014), which might be connected to idiosyncrasies in their DNA replication and repair systems (Zhang et al. 2016).

Since the first complete plastid genome sequences were applied to plant phylogenetics approximately 15 years ago, plastid phylogenomics has become widely used to resolve phylogenetic relationships among photosynthetic eukaryotes, including land plants (Ruhfel et al. 2014) and green algae (Sun et al. 2016). The increased availability of complete plastid genomes considerably increased the character base for phylogenetic analyses of disputed evolutionary relationships, including the early-diverging angiosperms. An active discussion has developed on the question if Amborella constitutes the sister to all angiosperms (e.g., Soltis et al. 1999; Borsch et al. 2003; Leebens-Mack et al. 2005; Müller et al. 2006; Jansen et al. 2007; Moore et al. 2007, 2011; Drew et al. 2014) or if it forms a clade with the Nymphaeales (e.g., Goremykin et al. 2003a, 2004, 2013). Even though the sequencing of complete plastid genomes has dramatically increased the number of genes under study (Goremykin et al. 2003a, 2004), a dense taxon sampling and the choice of methodology were found similarly important toward accurate phylogeny inference in early-diverging angiosperms (Jansen et al. 2007; Moore et al. 2007). In an attempt to increase taxon sampling in the Nymphaeales, two species of the family Hydatellaceae, Trithuria inconspicua (Goremykin et al. 2013) and Trithuria filamentosa (Drew et al. 2014), have recently been added to the so far sequenced plastid genomes of Nuphar advena (Raubeson et al. 2007) and Nymphaea alba (Goremykin et al. 2003a). This addition improved the representation of species of the Nymphaeales among the available plastid genomes of early-diverging angiosperms, but the family Cabombaceae has yet to be included.

The plant order Nymphaeales takes a pivotal role in the discussion on the evolutionary history of early-diverging angiosperms. A dense sampling of plastid genomes from this order would allow to test hypotheses concerning gene order and size of inverted repeats in the ancestor of all extant angiosperms. However, the phylogenetic relationships within the Nymphaeales are only partially understood. More than half of the diversity of the Nymphaeales can be found in the genus Nymphaea (Schneider and Williamson 1993; Borsch et al. 2008, 2011). The stem of the Nymphaeales is estimated to have diverged 108.6 Ma (±25.2 Ma), whereas the major diversifications occurred in the Eocene and the middle Miocene (Löhne et al. 2008; Iles et al. 2014). In its current circumscription, the Nymphaeales include the families Nymphaeaceae, Cabombaceae and Hydatellaceae, the latter of which were identified as part of the water lily clade only recently (Saarela et al. 2007).

The only direct comparison of plastid genomes of the Nymphaeales has been conducted by Raubeson et al. (2007), who compared the sequence similarity of genes, introns and intergenic regions in the plastomes of Nuphar advena and Nymphaea alba. They reported that more than two-thirds of the homologous regions are at least 95% identical and that only 3% of the regions are less than 70% identical. They also showed that the IR boundaries of Nuphar differed from those of Nymphaea and Amborella in that the ndhF gene spans across the IR/SSC boundary. To our knowledge, no subsequent investigation has specifically compared gene order and IR size of early-diverging angiosperms since then. The recently sequenced plastid genomes of Trithuria add an interesting new component to the structural diversity of the plastid genomes of early-diverging angiosperms. These genomes differ substantially in sequence length, displaying 5–20 kbp longer genome sequences than other early-diverging angiosperms, which may be indicative of major structural changes or of IR expansions.

In the present investigation, we generate complete and annotated plastid genomes of seven members of the Nymphaeales. Six of these genomes represent species that have not been sequenced before, including members of the Cabombaceae. We combine the DNA sequences of the novel plastid genomes with those of eight previously published plastid genomes of early-diverging angiosperms, generating a dataset that comprises the Nymphaeales with all major subclades and other early-diverging angiosperms (Amborella, Austrobaileyales). Upon careful review and, where necessary, correction of the gene annotations in each plastid genome under study, we evaluate and compare gene content, gene synteny and IR boundaries. We also assess the presence of translation initiation and translation termination codons in several open reading frames and hypothetical protein-coding genes, which are not consistently annotated across early-diverging angiosperm plastomes. Furthermore, we generate a multi-gene alignment of 77 plastid-encoded genes of the 15 plastid genomes under study and reconstruct the phylogenetic relationships between these taxa. Finally, we infer the phylogenetic placement of a previously published plastid genome that was designated as “Nymphaea mexicana” by the original authors, but which appears to be incorrectly determined based on a DNA sequence comparison with other records of Nymphaea.

Materials and methods

Taxon sampling and combination with previously published genomes

The plastid genomes of five species of the Nymphaeaceae and two species of the Cabombaceae were newly sequenced for this investigation. Specifically, we generated complete plastid genomes of the genera Barclaya, Nymphaea (N. alba, N. ampla, N. jamesoniana) and Victoria of the Nymphaeaceae and of the genera Brasenia and Cabomba of the Cabombaceae. These seven plastid genomes were combined with eight previously published plastid genomes of various early-diverging angiosperm families, generating a dataset of 15 plastid genomes that represent all currently recognized orders of early-diverging angiosperms. Species name, taxonomic position and GenBank accession number of each of the 15 records as well as sampling location and herbarium voucher information for each of the newly sequenced plastid genomes are presented as electronic supplementary material (Online Resource 1).

DNA extraction and genome sequencing

All plastid genomes sequenced for this investigation were generated from young leaves taken from live plant specimens cultivated at the Botanical Garden and Botanical Museum Berlin. Prior to DNA isolation, the edges of each leaf sample were removed with sterile razor blades, and the cut leaves were rinsed with deionized water and 70% ethanol. Total genomic DNA was isolated from 1.5 g of cleaned leaf material using the NucleoSpin Plant II kit (Macherey–Nagel, Düren, Germany). For each DNA extraction, a barcoded genomic library was constructed using the Nextera DNA library preparation kit (Illumina, San Diego, CA, USA). Finished libraries were quantified using a PicoGreen dsDNA quantitation kit (Invitrogen, Carlsbad, CA, USA). The libraries were pooled and sequenced as paired-end reads (MiSeq reagent kit, v3 chemistry, 600 cycles) on an Illumina MiSeq system at the Berlin Center for Genomics in Biodiversity Research with an insert size of 250–300 bp.

Genome assembly and annotation

A minimum of 1.6 million paired-end reads were generated per sample. Raw reads were trimmed by quality via FASTX Toolkit v.0.0.14 (Gordon 2014), using a minimum quality score of 30. Upon quality filtering, between 2.61 and 4.59% of all quality-filtered reads per sample were found to be of chloroplast origin, as they mapped to either of the two previously published plastid genomes Nymphaea alba (NC_006050; Goremykin et al. 2004) and Nuphar advena (NC_008788; Raubeson et al. 2007), which were used as reference genomes. Trimmed and quality-filtered reads were assembled de novo into contigs using Velvet v.1.2.10 (Zerbino and Birney 2008), testing a range of kmer values to optimize contig length (kmer = 33–97, in increments of 4). Back-mapping of successfully assembled reads to the complete plastid genomes indicated a mean coverage depth greater than 150 for each newly generated plastid genome, with more than 98.5% of all bases covered by a depth of 50 or greater. Individual contigs were combined manually into final assemblies with the help of the software application Geneious v.7.1.9 (Kearse et al. 2012), using the reference genomes as guides for contig position and orientation. Adjacent, overlapping contigs were combined only if less than 10% of the overlapping nucleotides were dissimilar. Minimally, between two and five contigs had to be so combined to cover half of each plastid genome. Read, contig and assembly statistics across newly generated plastid genomes were calculated via QUAST v.4.5 (Gurevich et al. 2013) and are provided as electronic supplementary material (Online Resource 2).

General genome structure and ambiguous nucleotide positions were evaluated through an additional assessment. The quadripartite structure of the final assemblies and the equality of the inverted repeats were confirmed by self-blasting each assembly using the BLAST + suite v.2.4.0 (Camacho et al. 2009). Single nucleotide ambiguities were resolved by mapping the trimmed reads against the regions with ambiguities using the short read mapper bowtie v.1.1.2 (Langmead et al. 2009) and assigning those nucleotides found by majority rule across the mapped reads.

Gene annotation of the assembled plastid genomes was carried out in a two-step procedure. First, we added raw annotations as predicted by the annotation servers DOGMA (Wyman et al. 2004) and cpGAVAS (Liu et al. 2012), selecting Nymphaea alba (accession NC_006050) as reference genome. Second, we inspected and curated all gene annotations and annotation names manually using Geneious. During this curating, we extracted all coding regions per genome, confirmed start and stop codons for each gene, aligned the extracted regions across all study taxa, confirmed approximate gene lengths based on their amino acid translations and reconfirmed any internal stop codons. This confirmation was carried out for all 15 input taxa, thus confirming the annotations of the newly generated as well as the previously published plastomes. The complete plastid genome sequences of all newly sequenced genomes are available from GenBank (Online Resource 1).

Inference and visualization of gene synteny

To visualize gene synteny across all 15 plastid genomes under study, the genes were listed by their location in the genome in the following partition order: LSC, inverted repeat B (IRB), SSC, inverted repeat A (IRA). In cases where genes have multiple exons separated by one or more coding sequences (e.g., trnK, rps12), the exons instead of the full genes were used in our synteny comparisons, with “p1” and “p2” appended to their names. Circular and linear genome maps were generated via OGDRAW v.1.2 (Lohse et al. 2013; Online Resource 3).

Multi-gene alignment and manual alignment correction

To infer the phylogenetic relationships among the newly sequenced plastid genomes, a multi-gene alignment was generated. Upon bioinformatic extraction of the full gene complement from each plastid genome under study, a gene-by-gene alignment was conducted, whereby each gene was translated into a sequence of amino acids, the amino acids aligned under the scoring matrix BLOSUM62 using MAFFT v.7.304b, and the aligned amino acid sequences back-translated to nucleotide sequences. The resulting DNA sequence alignments were manually adjusted following the rules of Löhne and Borsch (2005) using PhyDE v.0.9971 (Müller et al. 2007) to assure positional homology, following the understanding that alignment at the genome level must not be less rigorous than for individual genomic regions. Small inversions were separated in specific columns to prevent incorrect homology assessment. Upon manual alignment adjustment, the gene-wise alignments were concatenated in the same gene order as found in the actual genomes. The hypothetical protein-coding gene ycf1 was not included in the multi-gene alignment due to large regions of uncertain homology. Ribosomal and transfer RNA genes were also not included in the alignment given the uncertainty associated with accounting for their secondary structure (Michaud et al. 2011). Moreover, we excluded several sections of the genes accD, ndhF, rpoC1 and rpoC2 from the multi-gene alignment due to considerable uncertainty about the correct positional homology. The length of the exclusions was hereby set to a multiple of 3 bp, ensuring preservation of the reading frame.

Phylogenetic inference and data partitioning

Phylogenetic reconstructions were performed on the complete multi-gene alignment via maximum likelihood (ML) and Bayesian (BI) phylogeny inference. Analyses via ML were conducted with RAxML v.8.2.9 (Stamatakis 2014) using the best-fitting nucleotide substitution model GTR + G + I and the thorough ML optimization option. Branch support for ML analyses was calculated via 1000 bootstrap (BS) replicates using the rapid BS algorithm (Stamatakis et al. 2008) and the same nucleotide substitution models as under tree inference. Analyses via BI were conducted with MrBayes v.3.2.5 (Ronquist and Huelsenbeck 2003) under the best-fitting nucleotide substitution model as inferred by jModeltest v.2.1.7 (Darriba et al. 2012), using four parallel Markov Chain Monte Carlo (MCMC) runs for a total 20 million generations. Independent sampling of generations and convergence of Markov chains were confirmed in Tracer v.1.6 (Rambaut et al. 2014). The initial 50% of all MCMC trees were discarded as burn-in, and post-burn-in trees were summarized as a majority rule consensus tree, with branch support given as posterior probability (PP) values. The complete multi-gene alignment as well as the optimal phylogenetic trees inferred under ML and BI are available at Zenodo (https://zenodo.org/record/377039/).

To increase the accuracy of the inference of phylogenetic tree topology, branch lengths and substitution model parameters, we conducted our phylogenetic analyses under different data partitioning strategies. Specifically, we compared the results of four different partitioning strategies under both ML and BI phylogeny inference, with each strategy applied to the multi-gene DNA alignment. First, we conducted phylogenetic analyses on an unpartitioned matrix in which the entire multi-gene DNA alignment was analyzed under the nucleotide substitution model GTR + I + G; only a single partition was analyzed under this strategy. Second, we conducted phylogenetic analyses on a partitioned matrix in which each of the 77 genes of the alignment was analyzed under its best-fitting nucleotide substitution model as inferred by jModeltest; a total of 77 partitions were analyzed under this strategy. Third, we conducted phylogenetic analyses on a partitioned matrix in which each of the three codon positions across the alignment was grouped into its own partition; a total of three partitions were analyzed under this strategy. Fourth, we conducted phylogenetic analysis on a partitioned matrix that was inferred as the best-fitting partitioning strategy via PartitionFinder2 (Lanfear et al. 2016). Specifically, the software inferred a partitioning strategy with 18 different partitions as optimal given the multi-gene DNA alignment. A list of the individual partitions and their best-fitting nucleotide substitution models of the four data partitioning strategies is given as electronic supplementary material (Online Resource 4).

Evaluation of hypothetical genes ycf15 and ycf68

To investigate the presence and the DNA sequence conservation of the hypothetical protein-coding genes ycf15 and ycf68 in the plastid genomes of early-diverging angiosperms, we extracted the sequence of ycf15 from the plastid genome of Nicotiana tabacum (accession Z00044, Shinozaki et al. 1986) and the sequence of ycf68 from the plastid genome of Trithuria inconspicua and aligned them as baits to each of the 15 plastid genome sequences under study. We then extracted the best region that our baits aligned to from each genome, saved the extracted regions in gene-specific sequence sets and aligned each sequence set using MAFFT. The resulting alignments of ycf15 and ycf68 were compared for overall length, internal stop codons and shifts in reading frames. The reading frame of the bait sequences was hereby used reference for the transcribed exons in the plastid genomes. The alignments of ycf15 and ycf68 are available as electronic supplementary material (Online Resources 5 and 6).

Test of taxon designation of GenBank record NC_024542

Doubts about the taxon designation of a previously published plastid genome arose given its sequence similarity to other DNA sequence records of Nymphaea. In a preliminary investigation, we found that the sequences of petD, rpl16, trnK-matK and trnT–trnF of the plastid genome with GenBank accession number NC_024542 (“Nymphaea mexicana”; Yang et al. 2014) were exactly identical to previously published DNA sequences of Nymphaea odorata. We therefore decided to evaluate the taxon designation of this plastid genome as part of a genus-wide alignment. Specifically, we extracted the intergenic plastid spacer trnT–trnF from the 15 plastid genomes under study and included them in a previously published alignment of this marker (Borsch et al. 2011). We also added DNA sequences of species of Nymphaea subg. Nymphaea (Borsch et al. 2014), of Nuphar (Soininen et al. 2009) and of species outside the Nymphaeales (Borsch et al. 2003) to the alignment. We then reconstructed the phylogenetic relationships among the input sequences under both ML and BI. Phylogeny inference was conducted under the same settings as the inference of relationships among the complete plastid genomes using the multi-gene alignment, except that indels were coded according to the “Simple Indel Coding” scheme (Simmons and Ochoterena 2000) using SeqState v.1.40 (Müller 2005). DNA sequences and indel coding were combined into a partitioned dataset and analyzed with unlinked parameters. The indel partition was analyzed under the binary substitution model BINGAMMAI in RAxML and under the binary character model (Lewis 2001) in MrBayes. Based on these phylogenetic inferences, we refer to the plastid genome of accession NC_024542 as “Nymphaea cf. odorata” for the remainder of this manuscript.

Results

Genome structure and length of inverted repeats

The general genome structure of the 15 plastid genomes of early-diverging angiosperms under study is highly conserved. All plastid genomes analyzed display a typical quadripartite genome organization, with IR regions separating the LSC from the SSC (Fig. 1). The complete length of the plastid genomes ranges from 147,772 bp in Schisandra chinensis to 180,562 bp in Trithuria filamentosa, with all genomes of the Nymphaeaceae displaying a length between 158,360 and 160,866 bp (Table 1). The interquartile range (IQR) of the length of the LSC is between 88,737 and 90,199 bp, of the SSC between 18,822 and 19,562 bp and of the IR between 25,144 and 26,243 bp. The length variability of the three regions hereby differs markedly, with the length of the IR region, and thus the number of genes contained in them, being most variable. Specifically, the IR regions display a greater standard deviation (SD) in sequence length (SD = 8865 bp) than the LSC (SD = 6742 bp) or the SSC (SD = 4623 bp). Genome maps of each newly sequenced plastid genome are available as electronic supplementary material (Online Resource 3).

Fig. 1
figure 1

Comparison of genome structure and inverted repeat (IR) length among early-diverging angiosperms visualized via aligned linear genome maps. Genes are shown proportional to length and in direction of transcription (right: forward, left: reverse). Gene color indicates functional classification. IR regions are indicated by dark gray shading, and large single-copy region (LSC) and small single-copy region (SSC) are indicated by light gray shading. Species whose plastomes differ only marginally in length are represented by the same gene map: Nymphaea ampla, N. jamesoniana, N. mexicana, Victoria cruziana, Brasenia schreberi and Barclaya longifolia are represented by N. alba; Illicium oligandrum is represented by Schisandra chinensis

Table 1 Comparison of the plastid genomes of early-diverging angiosperms

The higher variability in length of the IR than of the LSC or the SSC across our study taxa is primarily the result of differential expansions of the IR regions in the genera Cabomba, Nuphar and Trithuria as well as its contractions in the two species of the Austrobaileyales under study when compared to Amborella (Fig. 1). In the two species of Trithuria, the IRA appears to have expanded into the SSC region compared to Amborella, integrating the first eight adjacent genes. In Trithuria filamentosa, an additional expansion is found, whereby the IRB appears to have expanded into the LSC, integrating the first 20 genes adjacent to the IR. In the genus Cabomba, a similar expansion of the IRB into the LSC is inferred when compared to Amborella or Nymphaea, whereby the first 9 genes adjacent to the IRB were integrated. The plastid genome of Nuphar advena is the only genome under study that displays an expansion of the IRA into the LSC compared to Amborella or Nymphaea, with the transfer RNA gene for histidine integrated into the IR. The IR regions of the two species of the Austrobaileyales under study, by contrast, display a contraction compared to the IRs of Amborella, as the IR regions of Illicium oligandrum and Schisandra chinensis no longer contain the first five genes of the IRB adjacent to the LSC.

Re-evaluation and correction of annotations

A re-evaluation of gene annotations of the plastid genomes under study was found to be very important toward a correct genomic characterization. Multiple potential annotation errors were identified through this curation procedure. For example, the GenBank record of the plastid genome of Trithuria inconspicua (accession NC_020372) appears to confuse the two transfer RNA genes for glycine (i.e., trnG-TCC and trnG-GCC) and lacks an annotation for gene clpP despite the presence of a translation initiation and a translation termination codon in its location (Table 2). Moreover, this record exhibits IR regions of unequal sequence and length and displays unequal annotations for the two copies of gene ndhI, which is duplicated as part of the IR. Similarly, the plastid genome of Nymphaea cf. odorata appears to confuse the two transfer RNA genes for glycine and displays an incomplete annotation for the transfer RNA gene for cysteine (trnC-GCA), which is short by approximately 10 nucleotides compared to Amborella (Table 2). Moreover, this record has unexpected duplicates of the transfer RNA genes for methionine (trnM-CAT), threonine (trnT-GGT) and proline (trnP-GGG), which are nested within other transfer RNA genes. Similar inconsistencies exist in the genome annotations of other GenBank records analyzed here (Table 2). In addition to these sequence-specific annotation uncertainties, most GenBank records of the 15 plastid genomes analyzed here also lack annotations for the open reading frames orf42 and orf56, which are located in the intron of the transfer RNA for alanine (trnA-TGC). Moreover, the start and stop codon positions of several gene annotations in previously sequenced plastid genomes of early-diverging angiosperms appear to be incorrect (Table 2). For example, the following start or stop codon positions appear to be incorrect in Nymphaea cf. odorata compared to the annotations of the plastid genomes of all other early-diverging angiosperms analyzed, as the latter share the same start and stop codons for the genes in question: Stop codon of exon 1 of gene atpF appears to be 12 bp (i.e., 4 amino acids) too late; start codon of exon 2 of gene petB appears to be 15 bp (i.e., 5 amino acids) too early; start codon of exon 2 of gene petD appears to be 51 bp (i.e., 17 amino acids) too early; start codon of exon 2 of gene rpoC1 appear to be presumably 9 bp (i.e., 3 amino acids) too early. Similar occurrences of potentially incorrect start and stop codon positions exist for the plastid genomes of Trithuria filamentosa and Trithuria inconspicua (Table 2).

Table 2 Overview of potential errors and instances for improvement in gene annotations of several previously published plastid genomes of early-diverging angiosperms

Gene complement of early-diverging angiosperms

The number of genes in the plastid genomes of early-diverging angiosperms appears to be highly conserved. All input taxa display a set of 116 unique genes in the plastid genome, not counting orf188 and the hypothetical protein-coding gene ycf68, given that it contains multiple stop codons (Online Resource 6). Of these unique genes, between 14 and 47 are duplicated in the IRs. Among this gene complement, 82 are protein-coding genes, 30 are transfer RNAs (tRNAs), and 8 are ribosomal RNAs (rRNA). A detailed list of genes detected in the plastid genomes under study is given in Table 3, which likely represents the plastid-encoded gene complement of early-diverging angiosperms.

Table 3 Presumptive gene complement of early-diverging angiosperms

The presence of specific gene features in the plastid genomes of early-diverging angiosperms also appears to be highly conserved. A trans-spliced version of the gene rps12, for example, is found among all plastid genomes under study, whereby the gene displays three distinct exons, with the first exon located in LSC, while the second and third exons are duplicated in the IR regions. We detected this exact configuration in all plastid genomes under study except for Trithuria filamentosa, where all three exons of the rps12 are part of the IR and thus duplicated. Similar to the trans-spliced version of rps12, the presence of two open reading frames in the intron of the tRNA for alanine (trnA-TGC) is detected among all plastid genomes under study. This occurrence has been previously reported by Chumley et al. (2006), who termed these reading frames orf42 and orf56, respectively. Since trnA-TGC is located in the IR, both reading frames are duplicated in the plastome. Reading frame orf42 displays intact translation initiation and translation stop codons and is devoid of any internal stop codons (Online Resource 7). Reading frame orf56 displays one of two possible translation initiation codons, ends with a translation stop codon and exhibits no internal stop codons (Online Resource 8). All taxa under study hereby exhibit the default start codon ATG for orf56, except in the plastomes of the genera Barclaya, Nymphaea and Victoria, which display the start codon GTG.

Presence of hypothetical genes ycf15 and ycf68

Next to the regular set of genes, we also detected two hypothetical protein-coding genes in the plastid genomes under study which are not consistently annotated and described in angiosperm plastomes: ycf15 and ycf68. Hypothetical gene ycf15 is found in all plastid genomes under study except those of the order Austrobaileyales. All analyzed plastid genomes of the Nymphaeales contain the ycf15 motif. Compared to the ycf15 region of Nicotiana tabacum, where it was described for the first time, these copies contain an intervening sequence, possibly an intron, between the flanking regions that may be transcribed (Online Resource 5). The 5′ flanking region of ycf15 starts with a GTG start codon, although an alternative ATG start codon exists at position 52–54 of the alignment. All analyzed species contain stop codons earlier than Nicotiana. Only Amborella trichopoda, Nymphaea alba and Nymphaea cf. odorata are without a stop codon in the first exon and thus display a potential reading frame of 61, 81 and 80 amino acids, respectively, compared to 88 amino acids in Nicotiana, assuming that the intervening sequence represents an intron. Both Trithuria species have insertions and one deletion in the first exon, leading to reading frame shifts. Similarly, all species except for Amborella have gaps in the second exon, leading to reading frame shifts and diverging amino acid sequences. Hypothetical gene ycf68 could be detected in all plastid genomes under study. Most sequences display gaps that result in reading frame shifts and internal stop codons. An alternative start codon at position 189 of the alignment results in a relatively long reading frame of 306 bp (i.e., 102 amino acids) in Nymphaea alba, Nymphaea cf. odorata, Nuphar advena and Brasenia schreberi, which may be translated (Online Resource 6).

Phylogenetic analysis and data partitioning

Our phylogenetic reconstructions of the multi-gene alignment resulted in highly resolved phylogenetic trees, with almost all clades recovered having full branch support (Fig. 2). Monophyly of the families Cabombaceae and Hydatellaceae was fully supported in each inference (BS 100, PP 1.00), whereas monophyly of the family Nymphaeaceae was not. The position of the genus Nuphar in relation to the rest of the Nymphaeaceae and the Cabombaceae was not resolved. All partitioning schemes resulted in very weak support for the position of Nuphar, with three of them indicating a clade of Nuphar and the Cabombaceae (Online Resource 9): BS 58/PP 0.85 in the unpartitioned matrix, BS 65/PP 0.88 in the matrix of one partition per codon and BS 55/PP 0.97 in the matrix inferred as optimal via PartitionFinder2. The ML tree of the partitioned-by-gene matrix displayed Nuphar as sister to a clade formed by the Cabombaceae and the rest of the Nymphaeaceae, but without statistical support (BS 33, PP 0.22). The phylograms that resulted under the different data partitioning strategies displayed little divergence in the lineage between the stem lineages of Nuphar and the Cabombaceae (Online Resource 9). Furthermore, Victoria cruziana was found nested within the genus Nymphaea in each phylogenetic reconstruction and with high statistical support, indicating paraphyly of the genus Nymphaea in its current circumscription. All phylograms also indicated a long stem lineage of Trithuria compared to all other plastid genomes analyzed.

Fig. 2
figure 2

Inferred phylogenetic relationships of 15 early-diverging angiosperm taxa and the potential phylogenetic position of the genus Nuphar. a Phylogenetic relationships of the 15 taxa under study as inferred from a multi-gene dataset of 77 plastid-encoded genes under ML phylogeny inference without data partitioning, visualized as cladogram (left) and as phylogram (right). Bootstrap support values of the ML inference greater than 50% and posterior probabilities (in italics) of a concomitant BI phylogeny inference greater than 0.5 are given above branches of the cladogram. b Three potential scenarios for the phylogenetic position of Nuphar within the Nymphaeales based on the results of our phylogenomic analyses that included data partitioning

Taxon designation of GenBank record NC_024542

Our reconstruction of the phylogenetic relationships between the 15 taxa under study, the 71 accessions of Nymphaea and the 16 accessions of related genera to evaluate the taxon designation of GenBank record NC_024542 (Yang et al. 2014) using trnT–trnF as phylogenetic marker indicated that the plastid genome in question unlikely represents Nymphaea mexicana. The sequence in question is recovered as sister to a specimen of Nymphaea odorata subsp. tuberosa (BS 78, PP 0.99) and embedded in a clade comprised exclusively of specimens of N. odorata (Fig. 3). The statistical support for this clade is medium to high (BS 65, PP 0.98). Our reconstructions also infer a sister relationship between the species Nymphaea odorata and N. mexicana, which is supported by maximum branch support (BS 100, PP 1.00). The phylogeny further recovers a fully supported clade of Cabombaceae and Nymphaeaceae pro parte without Nuphar (BS 100, PP 1.00).

Fig. 3
figure 3

Phylogenetic position of the 15 plastid genomes analyzed among a set of 71 accessions of Nymphaea and 16 accessions of related early-diverging angiosperms as inferred from a DNA sequence alignment of trnT–trnF. The displayed trees both constitute the tree with the highest likelihood score inferred, displayed as cladogram under (a) and as phylogram under (b). The sequences from the plastid genomes are highlighted in bold, and those from the genome with GenBank accession number NC_024542 (“Nymphaea mexicana,” Yang et al. 2014) are additionally highlighted in red. Bootstrap support values of the ML inference greater than 50% and posterior probabilities (in italics) of a concomitant BI phylogeny inference greater than 0.5 are given above branches in a. The branch from Amborella trichopoda to the rest of the species analyzed is truncated for better visualization in b

Discussion

Gene content and synteny in plastomes of early-diverging angiosperms

The gene content of land plants is relative stable, with gene losses in the angiosperms mainly associated with parasitic or heteromycotrophic lifestyles (Logacheva et al. 2011; Wicke et al. 2011; Cusimano and Wicke 2016). All taxa included in this investigation are photoautotrophic, and their gene content is almost identical. Amborella and all species of the Nymphaeales analyzed exhibit 116 genes (not including ycf68 and orf188), whereas the two species of Austrobaileyales lack the hypothetical gene ycf15 and thus exhibit 115 genes (Table 1). Major structural changes have been described from several angiosperm lineages (Haberle et al. 2008; Cai et al. 2008; Weng et al. 2014), but such changes were not found in the plastid genomes of the Nymphaeales.

Generally, the sizes of plastid genomes can vary considerably, from 63 kbp in the parasitic Phelipanche to 242 kbp in the highly rearranged plastomes of Pelargonium (Weng et al. 2014). More than half of the 1983 plastid genomes of seed plants that are currently available on GenBank (GenBank search on February 11, 2017, for “chloroplast genomes” of Spermatophyta with genome size >60 kbp) display a genome size between 140 and 160 kbp. With sizes between 158,360 and 160,866 bp, the complete lengths of the plastid genomes of the Nymphaeaceae species analyzed are on the upper end of this spectrum. In the Cabombaceae, Brasenia schreberi has a plastid genome size similar to members of Nymphaeaceae, whereas Cabomba caroliniana displays a larger plastome (164,057 bp). The plastid genome of Trithuria inconspicua has an even larger size (165,389 bp), while the plastome of T. filamentosa (180,562 bp) exceeds the size of all but ten publicly available plastid genomes. By comparison, the plastomes of both Austrobaileyales under study were found considerably smaller (<150 kbp). While plastome size reductions are common in parasitic plants as a result of gene loss (Wicke et al. 2013), size increases are usually the result of expansions of the IR regions (Weng et al. 2014). Similarly, the differences in plastome size recorded in the present investigation are mainly the result of extractions or contractions of the IR regions. It is noteworthy that the plastid genomes of the two species of Trithuria under study display a length difference of >15 kbp, despite being likely sister species. According to Iles et al. (2014), the two species have diverged very recently (0.5 Ma [0–1.1 Ma]). Given this recent time of divergence, it is likely that the IR expansion between these two species originated via a single change in length rather than a gradual increase in IR size.

The gene content and the gene synteny of the plastid genome sequences analyzed here are identical. Even though the topology of early-diverging angiosperm lineages is still disputed (Drew et al. 2014; Goremykin et al. 2015; Simmons 2016), the majority of studies support Amborella as sister to all other angiosperms, with Nymphaeales as the second and Austrobaileyales as the third diverging lineage (APW3, Stevens 2017). The conserved gene content and synteny in Amborella as well as the Nymphaeales leads to the hypothesis that the most recent common ancestor of both lineages, which may include all extant angiosperms, shared the same gene content and the same gene synteny, with the IRs spanning from rpl2 to ndhF. Our observed differences in IR length may have originated through length changes on distal branches, such as the IR expansions in the Hydatellaceae and the Cabombaceae or the IR contractions in the Austrobaileyales, but more research is needed to evaluate this hypothesis.

Correction of annotations and SSC orientation in Trithuria

The high degree of structural conservation across the plastid genomes of early-diverging angiosperms was initially obscured by ambiguous and probably incorrect annotations in several previously published plastid genomes (Table 2). Our manual corrections of the annotations of these previously published as well as our newly generated plastid genomes were instrumental in obtaining a more correct assessment of gene content and synteny. However, not all ambiguities in gene annotations or genome structure among the previously published genomes are the result of human error. For example, the orientation of the SSC in Trithuria appears to be inverted compared to the plastid genomes of other early-diverging angiosperms at first glance, but the relative orientation of the SSC is not fixed in plastid genomes, and both orientations may occur in the same individual (Palmer 1983). Although this form of plastid heteroplasmy has been confirmed in a wide variety of plant species (see Walker et al. 2015 for review), several recent investigations have overlooked this fact and as a result pronounced the SSC a hotspot of inversions (Walker et al. 2015). Consequently, the different orientation of the SSC in Trithuria compared to the plastid genomes of other early-diverging angiosperms is unlikely the result of a genomic rearrangement and more likely the observation of natural plastid heteroplasmy.

Open reading frames in early-diverging angiosperms

While the functions of most genes in the plastome are well understood (Wicke et al. 2011), several hypothetical reading frames of unknown function remain. Most plastid genomes contain a varying number of open reading frames (orf) that have different degrees of conservation. More conserved orf with similar amino acid content are called hypothetical chloroplast open reading frames and hypothetical protein-coding genes (ycf).

The presence of orf42 and orf56 in the plastid genomes of early-diverging angiosperms has been acknowledged by some but not all investigations that generated the previously published plastid genomes. We located the open reading frame orf42 in the plastid genomes of all 15 plastid genomes analyzed. Each of them displays intact translation initiation and translation stop codons and no internal stop codons (Online Resource 7). The open reading frame is located inside the intron of the tRNA for alanine (trnA-TGC) and, according to Chumley et al. (2006), has high sequence similarity with the 3′ end of the gene pvs-trnA, which in Phaseolus is found in the mitochondrial genome (Woloszynska et al. 2004). Similarly, the open reading frame orf56 is also present in the intron of trnA-TGC in all plastid genomes under study, is reported only from the plastomes of Nymphaea mexicana and Trithuria inconspicua and does not start with the common translation initiation codon (ATG) in all taxa (Online Resource 8). Instead, orf56 appears to start with the codon GTG in the plastomes of Barclaya, Nymphaea and Victoria. The codon GTG constitutes one of two alternative translation initiation codons that have been reported from plastid genes of angiosperms (Kuroda et al. 2007; Raubeson et al. 2007). According to Chumley et al. (2006), orf56 displays high sequence similarity with the gene ACRS, which is found in the mitochondrial genome in Citrus (Ohtani et al. 2002). Both orf42 and orf56 are found in numerous angiosperm plastid genomes, including Pelargonium (Chumley et al. 2006) and Veratrum (Do et al. 2013) and could thus be part of the default gene complement of angiosperm plastomes. Goremykin et al. (2003b) described the orf56 in Calycanthus and also remarked the similarity to the mitochondrial gene ACRS from Citrus. However, orf42 and orf56 do not seem to be translated to functional proteins throughout the angiosperms. In plastomes of the genus Utricularia, for example, orf42 and orf56 are reported as pseudogenes due to frameshifts and internal stop codons that are also absent from nuclear or mitochondrial genome contigs (Silva et al. 2016). We do not suggest that orf42 or orf56 necessarily constitute functional genes in the plastid genomes of early-diverging angiosperms and, at this point, have insufficient information to determine if their transcripts are translated to functional proteins. We do, however, wish to point out that the plastomes of all taxa evaluated in this investigation contain intact copies of both open reading frames, assuming that the codon GTG mediates successful translation initiation for orf56. It is noteworthy that those studies that have reported the presence of these hypothetical genes had taken additional steps toward improved gene detection and annotation refinement, such as the correction of gene annotations or the application of supplementary annotation software (e.g., Do et al. 2013; Silva et al. 2016).

The origin of the hypothetical gene ycf15 has been addressed by various earlier investigations. It was first described as orf87 in Nicotiana (Shinozaki et al. 1986) and subsequently examined with regard to its functionality as a protein-coding gene by Schmitz-Linneweber et al. (2001), who found that ycf15 occurs in Spinacia oleracea, Arabidopsis thaliana, Zea mays and Oenothera berteriana. The hypothetical gene was hereby found to contain an intervening sequence of 250–300 bp, which is absent in other plants like Nicotiana tabacum and was suspected to be an intron (Schmitz-Linneweber et al. 2001). The intervening sequence was hypothesized to be present in early-diverging angiosperms and that it was lost in several lineages throughout angiosperm diversification. Raubeson et al. (2007) compared the occurrence of ycf15 in a wider sample of angiosperms and showed that the intervening sequence is indeed present in several early angiosperms including Amborella, Nymphaea, Nuphar, most monocots and some eudicots, but that it was lost in all asterids investigated. The much deeper sampling of Nymphaeales in this study corroborates the conclusions by Schmitz-Linneweber et al. (2001) and Raubeson et al. (2007) for all Nymphaea species under study as well as the Cabombaceae and Hydatellaceae. The question of successful translation of ycf15 is less clear, however. The sequencing of cDNA by Schmitz-Linneweber et al. (2001) indicated a full translation of the flanking regions of ycf15, but internal stop codons would likely interrupt the coding of polypeptides. Large portions of the open reading frame present in Nicotiana occur only in Amborella trichopoda, Nymphaea alba and N. cf. odorata, while stop codons and frame shifts interrupt the reading frame in all other investigated taxa. The fact that ycf15 is partially preserved might reflect functional relevance, for example as a promoter or terminator sequence in gene regulation.

The hypothetical gene ycf68 is found in all plastid genomes under study but consistently contains internal stop codons in many angiosperms (Raubeson et al. 2007). In Utricularia, for example, the gene ycf68 has a frameshift and one in-frame stop codon and may thus only be transcribed but not translated (Silva et al. 2016). Among the taxa studied here, ycf68 also displays multiple internal stop codons. Goremykin et al. (2004) proposed a later start for Nymphaea alba (at nucleotide position 189), which results in an amino acid sequence without internal stop codons. It is likely that the same applies to the ycf68 sequences of other Nymphaea species and of Nuphar. The idea of Goremykin et al. (2004) should thus be given greater consideration. The sequences of Barclaya, Victoria and both Trithuria species, however, would maintain internal stop codons even under this alternative start position due to indels and subsequent reading frame shifts compared to other Nymphaeaceae plastomes.

Phylogenetic position of Nuphar and potential paraphyly of the Nymphaeaceae

Based on a multi-gene alignment of 77 plastid-encoded genes extracted from the 15 plastid genomes under study, we conducted phylogenetic reconstructions under different data partitioning strategies. In each of these reconstructions, the families Cabombaceae and Hydatellaceae are supported as monophyletic, whereas the family Nymphaeaceae is not (Fig. 2). The phylogenetic position of the genus Nuphar remained unresolved. Under an unpartitioned matrix, for example, Nuphar is recovered as sister to the Cabombaceae, but the respective node is merely supported by 0.85 PP under BI and 58% BS under ML. The extremely short branch subtending the most recent common ancestor of Nuphar and the Cabombaceae in this matrix furthermore indicates a rapid diversification (Fig. 2, phylogram); other partitioning strategies produced very similar results (Online Resource 9). Three potential scenarios therefore remain for the phylogenetic position of Nuphar within the Nymphaeales: Nuphar as an early-diverging lineage of the Nymphaeaceae (Fig. 2b-i); Nuphar as sister to a clade formed by the Cabombaceae and the rest of Nymphaeaceae (Fig. 2b-ii); and Nuphar forming a clade with the Cabombaceae (Fig. 2b-iii). The latter two scenarios render the family Nymphaeaceae paraphyletic. Thus, phylogenetic analyses are needed that include a more extensive taxon sampling of Nuphar and the Cabombaceae at the species level and maybe the additional use of plastid spacers and introns.

Previous investigations had also been unable to consistently resolve the phylogenetic position of Nuphar. The monophyly of the Nymphaeaceae including Nuphar was supported by a combined dataset of nine plastid regions including introns, intergenic spacers and matK (Löhne et al. 2007). However, an earlier analysis of trnT–trnF sequences found only limited support for such a clade (Borsch et al. 2003), although it did not include DNA sequences of the Hydatellaceae. A study based on concatenated DNA sequences of matK and ITS2 available on GenBank, including accessions of the Hydatellaceae, found support for a clade comprising of Nuphar and the Cabombaceae (Biswal et al. 2012). However, the results of Biswal et al. (2012) should be interpreted with care, as their taxon sampling is heavily asymmetric: None of the other early-branching angiosperms (Amborella, Austrobaileyales) were included in their dataset except the Nymphaeales, and gymnosperms were used as outgroup.

From a morphological perspective, a scenario in which the Nymphaeaceae are monophyletic and sister to Cabombaceae currently remains the most plausible solution. A parsimony analysis of 66 morphological characters, including data on the Hydatellaceae, placed Nuphar as sister to the remainder of Nymphaeaceae (Borsch et al. 2008). Furthermore, eusyncarpous carpels were hypothesized to have arisen in the common ancestor of the Nymphaeaceae, while the Cabombaceae and the Hydatellaceae have apocarpous carpels (Borsch et al. 2008). Eusyncarpy also occurs in Illicum and the eudicots (Doyle and Endress 2000; Rudall et al. 2007), suggesting multiple gains of this feature in angiosperms. The alternative hypothesis of a sister relationship between Nuphar and the Cabombaceae (and also the third topology with Nuphar and Cabombaceae as successive sisters to the core Nymphaeaceae) would require a more complex explanation for the evolution of the gynoecium in the Nymphaeales. In the event of a sister relationship between Nuphar and the Cabombaceae, a less than parsimonious solution would also apply to pollen evolution, as the granular-intermediate infratectum is considered a synapomorphy of all Nymphaeaceae (Borsch et al. 2008).

As indicated by long branches of the presented phylogenies (Figs. 2, 3), the two representatives of the Hydatellaceae display numerous autapomorphic nucleotide changes compared to the other taxa under study. In order to determine if the phylogenetic position of Nuphar was recovered as the earliest-diverging genus of the Nymphaeaceae and Cabombaceae (Figs. 2, 3) represents an artifact of long-branch attraction to the Hydatellaceae, we repeated our phylogeny inference on the trnT–trnF sequence alignment after excluding the DNA sequences of the Hydatellaceae. The resulting phylogeny did not deviate from the initial one, indicating that trnT–trnF either contains a different phylogenetic signal than the multi-gene alignment of 77 plastid-encoded genes or a different level of homoplasy.

The results of our investigation support the hypothesis that the genus Nymphaea is paraphyletic in its current circumscription. A clade comprised of Barclaya, Nymphaea, Victoria and Euryale was supported by Borsch et al. (2007) and by Löhne et al. (2008). The shift in translation initiation codon of the open reading frame orf56 from ATG to GTG may be a molecular synapomorphy for this clade (Online Resource 8). While the first well-sampled molecular phylogenies of the Nymphaeales based on trnT–trnF (Borsch et al. 2007) inferred the Victoria-Euryale clade as sister to a weakly supported monophyletic genus Nymphaea (including Ondinea, see Löhne et al. 2009), the addition of further plastid sequence data (Löhne et al. 2007) recovered it as sister to a clade comprised of Nymphaea subg. Hydrocallis and Nymphaea subg. Lotus, although with only moderate branch support. The phylogenetic trees inferred from our alignment of 77 plastid-encoded genes strongly support the placement of Victoria within Nymphaea and also support the first branching position of the temperate subclade of Nymphaea (Fig. 2). Thus, it appears that genome-scale data hold great promise to further illuminate the relationships and evolutionary diversification of the water lilies.

Taxon designation of GenBank record NC_024542

Doubts about the correct identification of a previously published plastid genome of a species of Nymphaea arose upon the comparison of standard phylogenetic DNA markers across this and other Nymphaea samples. The GenBank record in question was published by Yang et al. (2014) as part of an investigation on primer development for the long-range PCR amplification of plastid genomes and was designated as “Nymphaea mexicana.” Our phylogenetic tree based on trnT–trnF recovered this sequence as nested within the North American clade of Nymphaea odorata and as sister to a specimen from Vermont (USA), which belongs to Nymphaea odorata subsp. tuberosa. The trnT–trnF sequence of NC_024542 is completely identical to the sequence of the Vermont specimen, even in the hypervariable parts of the P8 stem loop (data not shown), which were excluded from the alignment. Unfortunately, we were unable to view and assess the herbarium record associated with NC_024542, as the herbarium voucher could not be located by the original institution (E. Liu, pers. comm.). Although both species (N. mexicana and N. odorata) are monophyletic (Borsch et al. 2014), there are hybrids between them, and plants of Nymphaea odorata have occasionally been crossed into other species to breed ornamental specimens. Some of these ornamental plants have a yellowish flower color, and it seems possible that such an ornamental individual may have been used for generating the plastid genome NC_024542. The uncertainty around the taxonomic identity of this plastid genome illustrates that the careful identification of plant material and the generation of publicly available herbarium specimens for taxonomic re-evaluation remain important tasks in the process of phylogenomic analysis.

Conclusion

The plastid genomes of early-diverging angiosperms were found to be highly conserved with regard to gene content and gene synteny. The full degree of conservation did, however, only become apparent after the manual correction of the annotations of several previously published plastid genomes under study, underscoring the need for the re-assessment of genome annotations that have not undergone stringent manual curation. Our phylogenetic reconstructions revealed a potentially paraphyletic family Nymphaeaceae and a paraphyletic genus Nymphaea, which stand in contrast to the results of prior molecular and morphological studies and indicate the need for further investigation. In particular, the phylogenetic position of the genus Nuphar requires additional assessment. Methodologically, our results indicate that plastid genomics in the Nymphaeales offers great potential for further insights into the evolutionary diversification of the water lily clade. In particular, the fully conserved gene order has the potential to provide a study case for genome-level alignments that include non-coding regions.