Introduction

Although the plastid genomes of diatoms encompass a relatively narrow spectrum of the distribution in size, gene content, and architecture of organelle genomes (Smith and Keeling 2015), they nevertheless exhibit substantial variation among species in gene and intergenic sequence content (Brembu et al. 2014; Ruck et al. 2014). This variation reflects numerous underlying processes that continue to drive genomic divergence among species. These processes include: (1) gene duplication, pseudogenization, and loss (Ruck et al. 2014); (2) ongoing intracellular transfer of genes to the nucleus (Lommer et al. 2010; Sabir et al. 2014), and; (3) acquisition of foreign DNA from sources inside, and perhaps even outside, the cell (Brembu et al. 2014; Ruck et al. 2014). Altogether, these processes appear to have affected roughly 10 % of the total protein-coding gene set and have occurred against the backdrop of a genome that, despite pervasive rearrangements, maintains the prototypical circular mapping, tripartite plastid genome architecture and a conserved, AT-rich nucleotide composition (Ruck et al. 2014).

Our ability to discern even broad-scale patterns of evolutionary change in the organelle and nuclear genomes of diatoms is, however, limited by the scarcity of genomic data for this group. Among the tens of thousands of extant diatom species (Mann and Vanormelingen 2013), just a few dozen have fully sequenced plastid genomes. Even fewer species have fully sequenced mitochondrial genomes, and nuclear genomes have been sequenced for only a small number of mostly model species (Armbrust et al. 2004; Bowler et al. 2008; Lommer et al. 2012). Sparse or biased taxonomic sampling can mislead biological interpretations of genomic data. For example, as more genomes are sampled, improved models can lead to the discovery of genes that were once thought missing (Simola et al. 2013), and genes that were once thought to have a foreign origin can be revealed as native (Salzberg et al. 2001). Dense sampling is an absolute prerequisite to understand the evolution of gene families with especially complex histories, such as the algal por gene, which has been serially duplicated, lost, and passed around among sometimes distant relatives by horizontal gene transfer (Hunsperger et al. 2015).

We sequenced the plastid genome of the tropical marine diatom, Toxarium undulatum Bailey, providing a first glimpse into plastid genome evolution in this part of the diatom phylogenetic tree. The presence of a complete set of genes encoding the light-independent protochlorophyllide oxidoreductase (LIPOR) pathway, previously thought missing in diatoms, expands our understanding of the functional photosynthetic repertoire of the ancestral diatom and further highlights the paradoxical retention in some taxa of a biochemical pathway that is clearly expendable. The genome also contains a putatively foreign group II intron not previously known from diatoms. Among sequenced diatom plastid genomes, these two features are found only in T. undulatum, but we show that these parallel patterns likely resulted from very different underlying processes.

Materials and methods

Diatom culturing, DNA extraction, and sequencing

A clonal culture of Toxarium undulatum (strain ECT3802) was established from a field collection of epiphytes associated with Gab Gab Reef, Apra Harbor, Guam, USA (13.44°N, 144.64°W) in October 2008. The culture was maintained in L1 medium (Guillard 1975) at 22 °C on a 12:12 h light:dark cycle. Live cells were concentrated by centrifugation, frozen, and then disrupted by mixing with glass beads using a Mini-Beadbeater-24 (BioSpec Products). Genomic DNA was extracted with a Qiagen DNeasy DNA Plant Mini Kit and sequenced on the Illumina MiSeq platform at the Institute for Genomics and Systems Biology at Argonne National Laboratory, generating 150-nt paired-end reads from libraries 300 nt in length.

Genome assembly and analysis

We assembled sequencing reads with Ray ver. 2.2.0 using default settings and a kmer length of 31 (Boisvert et al. 2010). We then used Geneious ver. 5.4 (Biomatters Ltd., Auckland, New Zealand) to identify gaps and finish the assembly. We annotated protein-coding genes with DOGMA (Wyman et al. 2004) and used ARAGORN (Laslett and Canback 2004) to identify predicted tRNAs and tmRNAs. We checked the boundaries of rRNA and ffs genes by comparisons to previously sequenced diatom genomes using the National Center for Biotechnology Information’s (NCBI) BLASTN software. The annotated genome sequence can be downloaded from GenBank using accession number KX619437.

To reconstruct the phylogenetic history of chl genes in T. undulatum, we collected chl gene sequences for representative heterokonts, red algae, green algae, and cyanobacteria from NCBI’s GenBank sequence repository. We then aligned conceptual amino acid translations with MAFFT (Katoh and Standley 2013) using default settings. In preliminary tree searches with the nucleotide alignments, we found that removal of visually poorly aligned regions did not affect the phylogenetic placement of T. undulatum, so all analyses used the full-length alignments. The individual chlB, chlL, and chlN gene trees were congruent with respect to the placement of T. undulatum (Fig. S1), so the three genes were concatenated to produce a single and more strongly supported phylogenetic hypothesis. Using IQtree v.1.4.1 (Nguyen et al. 2015), we identified cpREV + R5 (Adachi et al. 2000) as the best substitution model for the concatenated amino acid alignment. In this model, R represents the number of rate categories in the FreeRates model for among-site rate variation (Yang 1995). We used IQtree to perform 25 maximum likelihood optimizations with default settings, choosing the tree with the highest likelihood as the best one. Support for inferred relationships was obtained using bootstrap analysis with 500 pseudoreplicates. Multiple sequence alignments were deposited in an online data repository hosted by Zenodo (doi:10.5281/zenodo.58229).

Results and discussion

General features

The plastid genome of T. undulatum mapped as a single, circular molecule of length 141,681 nt (Table 1; Fig. 1), though it is unclear whether this topology exists in vivo (Bendich 2004). The genome has the tripartite architecture conserved across diatoms and many other plastid-bearing lineages (Smith and Keeling 2015), with small and large single copy regions separated by a pair of large inverted repeats (Fig. 1). Similar to the plastid genomes of most other eukaryotes, including diatoms (Ruck et al. 2014; Smith 2012), the plastid genome of T. undulatum is AT-rich (Table 1). Overall, the plastid genome of T. undulatum contains the vast majority of conserved open reading frames, protein-coding genes, and rRNA and tRNA genes that are near-universally conserved across diatoms. Several less conserved, mostly accessory genes, are missing from the genome, including acpP, bas1, ilvB, ilvH, and petJ. Absence of these genes reflects either deep losses within diatoms (ilvB, ilvH, and petJ) or more recent losses (acpP and bas1) (Ruck et al. 2014; Sabir et al. 2014). Pinpointing the exact timing of these losses will require more genomic sampling across diatoms.

Table 1 General features of the plastid genome of the diatom, Toxarium undulatum
Fig. 1
figure 1

Annotated map of the plastid genome of the diatom, Toxarium undulatum. Genes drawn on the inside of the circle are transcribed in the clockwise direction, whereas those on the outside of the circle are transcribed in the counterclockwise direction. The interior gray bar plot shows the average G + C content. The genome map was rendered with OGDRAW (Lohse et al. 2007)

Light-independent protochlorophyllide oxidoreductase genes

Oxygenic photosynthesis in diatoms relies on chlorophyll a and the accessory pigment, chlorophyll c (Round et al. 1990). The last steps of chlorophyll a biosynthesis involve reduction of protochlorophyllide a to chlorophyllide a, the direct precursor molecule to chlorophyll a (Armstrong 1998). Although this conversion can be carried out through either light-dependent (POR) or light-independent (LIPOR) pathways, LIPOR genes have been lost repeatedly within virtually all major lineages of photosynthetic eukaryotes. The genes are missing from angiosperms and some gymnosperms (Ueda et al. 2014), Euglenophytes and many green algae, haptophytes, some cryptophytes and rhodophytes, and numerous algal heterokonts (Fong and Archibald 2008; Hunsperger et al. 2015). The presence of LIPOR genes in the plastid genomes of some heterokonts, including Triparma laevis—the sister lineage to diatoms—and their absence from the entire set of sequenced diatom plastid and nuclear genomes, suggested that these genes were lost in diatoms following their split from Parmales (Hunsperger et al. 2015; Tajima et al. 2016).

The plastid genome of T. undulatum contains three intact genes—chlB, chlL, and chlN—that together encode the complete LIPOR pathway (Fig. 1). This is the first report of these genes from among the few dozen completely sequenced diatom nuclear and plastid genomes. Given this highly restricted distribution, we used phylogenetic analysis to test two competing hypotheses that could account for their exclusive presence in T. undulatum: (1) the LIPOR genes were transferred into T. undulatum by a foreign donor, or (2) the genes were present in the plastid genome of the ancestral diatom and differentially retained in T. undulatum. We compiled LIPOR genes from a phylogenetically broad set of photosynthetic organisms, including cyanobacteria, land plants, and a diverse set of algae in the “red” plastid lineage (Fig. 2). The dataset included LIPOR genes from several algal heterokonts, including Tr. laevis, the sister lineage to diatoms. Phylogenetic analyses showed that each of the three genes individually was placed within heterokonts and sister to Tr. laevis (Fig. S1). A concatenated tree provided strong support for this relationship (Fig. 2), ruling out horizontal transfer as the source of these genes in T. undulatum and instead indicating that LIPOR genes were ancestrally present in diatoms. This hypothesis is also supported by conserved synteny. Although the T. undulatum and Tr. laevis genomes are highly rearranged compared to one another, the structure and arrangement of the LIPOR genes are similar in the two genomes. The genes are located in single-copy regions of the genome in both species, with a syntenic and identically oriented chlLchlNrps6 gene cluster distantly separated from a psaFchlB cluster (Fig. 1).

Fig. 2
figure 2

Phylogenetic tree of amino acid sequences from three concatenated LIPOR genes (chlB, chlL, and chlN) present in the plastid genome of the diatom, Toxarium undulatum. This is the best of 25 optimizations using IQtree with default settings and a cpREV + R5 model of amino acid replacement. Numbers at nodes are standard bootstrap proportions from 500 pseudoreplicates

Repeated gene loss is a familiar theme in diatom plastid genomes, with some genes lost six or more separate times over the course of diatom evolution (Ruck et al. 2014). Another common theme is that many of the same genes repeatedly lost in diatoms are dispensable in other algal groups as well (e.g., bas1; Sánchez-Puerta et al. 2005). Based on our current understanding of phylogenetic relationships within diatoms (Theriot et al. 2015) and what we know from the relatively small sample of diatom plastid genomes, we estimate that LIPOR genes were lost at least five separate times across diatoms as they were maintained, fully intact and presumably functional, within T. undulatum (Fig. 3). Further sampling will show whether these genes are present throughout the genus Toxarium and in related genera, including Ardissonea, Climacosphenia, and allies (Medlin et al. 2008).

Fig. 3
figure 3

Different processes account for the unique presence, among diatoms, of LIPOR (chl) genes and an intron within the psaA gene in the plastid genome of Toxarium undulatum. The chl genes were lost at least five times (“–CHL” branch annotation) following the split of diatoms from Parmales, whereas the intron appears to have been recently acquired, presumably from a foreign donor. Phylogenetic relationships are based on Theriot et al. (2015), and genomic comparisons were based on plastid genomes from Ruck et al. (2014), Sabir et al. (2014), and Tajima et al. (2016)

Several compelling hypotheses have been proposed to account for the preferential reliance on the POR pathway for enzymatic reduction of protochlorophyllide in most taxa. The LIPOR complex likely originated in anoxygenic bacteria and at a time when atmospheric oxygen levels were much lower than they are today (Raymond et al. 2004; Reinbothe et al. 1996). The LIPOR protein complex contains an oxygen-sensitive iron sulfur cluster that exhibits reduced functionality in oxygen-rich conditions associated with high light or long photoperiods (Ueda et al. 2014; Yamazaki et al. 2006). The POR pathway, by contrast, evolved in the more photosynthetically active cyanobacteria and is insensitive to oxygen. The LIPOR pathway is, as a result, essentially incompatible with the high levels of oxygenic photosynthesis that characterize modern photosynthetic eukaryotes. In addition, many diatoms live in offshore regions of the ocean where iron is growth-limiting (Boyd et al. 2007). Several offshore species show specific adaptations to these environments, either through reduced iron demands (Peers and Price 2006; Strzepek and Harrison 2004) or luxury uptake and storage of iron (Marchetti et al. 2009). Repeated loss of the LIPOR pathway across diatoms may, therefore, reflect another type of adaptation to low iron in ancestral diatom taxa (Hunsperger et al. 2015).

The more vexing question is, however, why LIPOR genes are maintained alongside POR genes in some lineages (Fong and Archibald 2008; Hunsperger et al. 2015). Although they may confer some advantage to taxa regularly exposed to low light or with a frequent dependency on heterotrophic growth, neither of these seems to be particularly applicable to T. undulatum—a neritic, benthic diatom capable of active motility (Hasle and Syvertsen 1997; Kooistra et al. 2003). The three LIPOR genes in T. undulatum are highly conserved in sequence, with T. undulatum chlB and chlN showing >76 % amino acid identity with Tr. laevis, and chlL showing fully 92 % identity between T. undulatum and Tr. laevis—two taxa that split more than 150 Mya (Sorhannus 2007). In short, these genes appear to have been conserved and maintained in the T. undulatum lineage since the origin of diatoms and are by no means in the process of being lost, as LIPOR genes are in some lineages (Fong and Archibald 2008). Additional genomic sampling will show how far back in Toxariales these genes have been conserved, whether any other diatom lineages have also retained LIPOR genes, and whether retention or losses have ecological or other genomic correlates (e.g., the coincidental duplication of por genes; Hunsperger et al. 2015).

Novel group II intron in psaA

Among plastid genomes, those of diatoms are somewhat unusual in their propensity to collect and maintain noncoding DNA sequences, including deteriorating pseudogenes, plasmid-derived genes, and ORFs or other sequences of unknown origin (Ruck et al. 2014). Although largely unrecognizable, the restricted phylogenetic distributions of sequences in the latter category suggest that they may be recent acquisitions, either from the cell’s own (unsequenced) nuclear genome or some foreign source (Ruck et al. 2014). At least a few of the most recognizable “extra” sequences were, indeed, very likely acquired from an outside donor. For example, among the few dozen diatom species with a sequenced plastid genome, just one of them, Seminavis robusta, contains introns (Brembu et al. 2014). One of these is a group I intron within the large subunit rRNA (rnl) gene. The intron contains an intron-encoded protein (IEP) with a LAGLIDADG-type homing endonuclease, which is known to have facilitated the horizontal spread of similar introns among organelle genomes in other lineages (Belfort and Perlman 1995; Sánchez-Puerta et al. 2008; Turmel et al. 1995). The plastid genome of S. robusta also contains a group II intron located in the atpB gene (Brembu et al. 2014). This intron contains an IEP encoding a reverse transcriptase—a common feature of introns located in rRNA genes that also facilitates intron mobility and horizontal transfer (Johansen et al. 2007). Both introns in Seminavis are most similar to ones found in green algae, which led to the conclusion that they were acquired by horizontal transfer from green algal donors (Brembu et al. 2014). Given the available data, however, it cannot be ruled out that the transfer occurred in the other direction, from diatoms to chlorophytes, or that these two lineages never directly exchanged the intron at all (see discussion below). Introns with aberrant, highly restricted phylogenetic distributions have been found in other algal plastid genomes, too—for example, the nested group II/III “twintrons” in some cryptophytes that either were inherited vertically (Perrineau et al. 2015) or were acquired from a euglenoid-like donor (Khan and Archibald 2008).

The plastid psaA gene in T. undulatum is interrupted by a large (2844 nt in length) group II intron (Fig. 1), the first report of this intron in diatom plastid genomes. The intron is of type IIB, with the six predicted domains common to group II introns and the conserved 5′ (GUGYG) and 3′ (AY) end sequences (Zimmerly and Semper 2015). Presence of the catalytic AGC triad in domain V and a bulging A motif in domain VI, which are critical for splicing, together suggest that the intron is intact and autocatalytic (Gordon and Piccirilli 2001; Zimmerly and Semper 2015). Domain IV of the intron contains an IEP that is 507 amino acids in length (Fig. 1). A BLASTP search of the IEP against NCBI’s Conserved Domain Database (Marchler-Bauer et al. 2015) revealed the characteristic reverse transcriptase (RT), maturase (X), and DNA-binding (D) domains, but it lacked the H–N–H endonuclease domain found in many IEPs (Blocker et al. 2005). The reverse transcriptase domain includes all 13 putative active sites, all 8 putative dNTP binding sites, and the nucleic acid binding site (Qu et al. 2016). Finally, the psaA gene itself is fully intact, with no signs of degradation or coconversion of the flanking exon sequences (Belfort and Perlman 1995; Lambowitz and Belfort 1993), so available evidence suggests that the intron has not disrupted the functionality of this important photosystem gene.

To better understand the origin of this intron in T. undulatum, we used NCBI-BLASTN to search the entire 2.8 kb intron sequence against NCBI’s non-redundant nucleotide sequence database. The only strong matches were to the IEP, so we used NCBI-BLASTX to search the IEP against NCBI’s non-redundant protein sequence database. The top 10 hits matched IEPs within group II introns in plastid genes of green algae in the division Chlorophyta, with nine of them matching Chlorophyceae (e.g., the Chlamydomonas and Volvox lineage). Five of the top 10 hits matched IEPs located in the plastid psaA genes of chlorophycean green algae. The top matching green algal IEPs aligned along nearly the entire length of the gene but shared just 40–50 % amino acid identity, with very little matching sequence outside the IEP. The low similarity of these matches, combined with the relatively small number of matches outside of green algae, indicated that phylogenetic analysis would offer no additional information about the ancestry of this intron.

Although currently available data preclude identification of the putative donor, the highly mobile nature of group II introns (Belfort and Perlman 1995; Lambowitz and Belfort 1993) combined with the sporadic distribution of this particular intron—both within diatoms and across algae—are consistent with its introduction into T. undulatum or an earlier ancestor by horizontal transfer. Although most similar to group II introns within psaA genes of chlorophycean green algae, they are unlikely to be the proximal donor of this intron to T. undulatum. Phylogenetic studies of green algae have relied heavily on plastid genome data (Lemieux et al. 2014, 2015), so green algae are disproportionately represented in algal plastid genome databases. As a result, although the diatom and green algal psaA introns almost certainly have some shared history, the apparent close relationship found here and elsewhere (Brembu et al. 2014) may simply be a sampling artifact. Finally, the best horizontal transfer hypotheses have both strong phylogenetic support and either a plausible hypothesis for the transfer mechanism (Rice et al. 2013) or evidence of an intimate relationship between exchanging species (Mower et al. 2010; Sloan et al. 2014). To the best of our knowledge, no such relationship exists between diatoms and green algae. Several studies (Deschamps and Moreira 2012; Woehle et al. 2011) have cast serious doubt on the hypothesis that diatoms temporarily harbored an ancient “green” endosymbiont (Moustafa et al. 2009).

Additional sampling of plastid genomes is necessary to reconstruct the phylogenetic history of this intron, including what other algal lineages may harbor them, how recently they all diverged, and the precise pattern and directions of exchange among species. There is a growing appreciation for the intimate relationships between diatoms and both bacteria (Amin et al. 2012) and viruses (Tomaru and Nagasaki 2011), so it may be that intron transfers have been mediated by shared bacterial, viral, or plasmid vectors.

In summary, although currently available data suggest that, within diatoms, the psaA intron in T. undulatum is a recent arrival from a foreign donor, additional plastid genome data will provide the ultimate test of this hypothesis. If upheld, these data may shed valuable light on the mechanism of transfer, highlighting previously unknown associations of diatoms with other organisms. The possibility exists that further sampling may reveal the presence of this intron in other diatoms or heterokonts, strengthening support for vertical inheritance and widespread loss, similar to building evidence for the ancestral presence of twintrons in cryptophyte plastid genomes (Khan and Archibald 2008; Perrineau et al. 2015). Regardless, data from this study further underscore the dynamic nature of diatom plastid genomes. Ongoing work to sequence plastid genomes from a more phylogenetically diverse sample of diatoms will allow us to reconstruct the full pan-genome of diatom plastids and tease apart fine-scale patterns of gains and losses of aberrantly distributed genomic features.