Keywords

1 Nuclear-Encoded Introns

Many nuclear-encoded protein-coding genes in Euglena contain introns which possess variable properties resulting in their classification into at least two distinct categories: conventional spliceosomal introns that are predicted to be removed from precursor mRNAs by the characterized Euglena spliceosome components and so-called “non-conventional” (non-canonical) introns that are excised by unknown cellular components. From the limited set of Euglena genes whose sequences have been determined and compared to their expressed mature mRNA sequences, it appears that having multiple introns and possessing both intron types in an individual gene is relatively common.

The non-conventional introns are defined as containing extensive secondary structural potential via base-pairing of intron 5′ and 3′ end proximal sequences, but little overall intron sequence conservation (Tessier et al. 1991; Canaday et al. 2001; Russell et al. 2005; Milanowski et al. 2014, 2016; Muchhal and Schwartzbach 1992, 1994) (Fig. 8.1b). They also frequently contain direct repeat sequences, of variable length, at the intron termini creating uncertainty in the accurate prediction of splice donor and acceptor sites for some of these introns. The lack of strict conservation of the direct repeats and their sequence variability indicates that they are unlikely to have a role in the splicing mechanism but may instead be remnants of intron sequence insertion and mobility events. Milanowski et al. have noted that these features are reminiscent of MITE-like transposon elements (Milanowski et al. 2014); therefore, if the non-conventional introns have been derived from such elements then perhaps trans-acting factors that associate with them may have been co-opted to be involved in the splicing mechanism. While there is no apparent conservation of extended sequence elements in these introns, intron sequence comparisons have revealed a preference for intron 5′ end proximal nucleotide positions +4 to +6 to be ‘CAG’ and the complementary sequence (CTG) starting 6 nucleotides from the intron 3′ end (Milanowski et al. 2014, 2016) (Fig. 8.1b). The conservation of these short sequences and their ability to base pair may be required for the splicing mechanism and accurate determination of splice site boundaries.

Fig. 8.1
figure 1

Structural features of the different E. gracilis intron classes. (a) Canonical structure for a U2-type spliceosomal intron with intronic 5′ splice site boundary nucleotides in green and 3′ nucleotides in red, and the branch point A in black. (b) Secondary structure of an E. gracilis non-conventional intron. For those introns lacking direct repeats, nucleotides that sometimes show adherence to 5′ and 3′ splice site boundaries of conventional introns are in green and red respectively. Conserved +4 to +6 nucleotides and base paired nucleotides are shown in black. There are variable numbers of nucleotides in the base-paired stems and the dashed line represents the variable length intronic region. (c) General secondary structure for a chloroplast twintron arrangement, in this case composed of a group II intron inserted within a group III intron. Conserved structural domains for both introns are labelled. Position of insertion of the group II intron in domain VI of the group III intron is indicated by the small box. Relative positions of branch point A nucleotides for both introns are indicated

Some of the Euglena non-conventional introns contain intron terminal nucleotides (5′GT or AG3′) identical to those of most conventional spliceosomal introns (Canaday et al. 2001; Milanowski et al. 2014) (Fig. 8.1a). Additionally, some of the predicted spliceosomal introns show extended base-pairing potential in intron locations similar to regions of secondary structure observed in the non-conventional introns. Such observations have raised questions about whether interconversion between intron classes may occur during intron sequence evolution in Euglena and whether introns demonstrating mixed features of both classes should be classified as a distinct type called “intermediate” introns (Canaday et al. 2001; Russell et al. 2005). These could potentially be excised using components of both the spliceosome and trans-acting non-conventional intron splicing factors .

Milanowski et al. have recently examined the conservation of intron position and class in conserved nuclear genes in different Euglena species to shed light on such questions (Milanowski et al. 2014, 2016). These studies have further refined the limited conserved sequence and structural features of non-conventional introns (as described above) and revealed that non-conventional intron gain/loss appears to occur much more frequently than observed for euglenid spliceosomal introns. There is also much greater intron length variation in different species at conserved non-conventional intron positions than is the case for the conserved spliceosomal introns . A preference for a 5′ purine nucleotide in non-conventional introns has also been observed (Milanowski et al. 2016) that perhaps affects splicing efficiency, thus explaining the frequent observation of non-conventional introns starting with the sequence 5′GT/C (i.e. spliceosomal-like) but also containing all other typical features of non-canonical introns. Such introns had previously been categorized as intermediate type; however, many of these introns have only poor base-pairing potential to the characterized Euglena spliceosomal U1 snRNA sequence (Breckenridge et al. 1999) making it unclear whether these introns are in fact in a transition state between intron classes and utilize any spliceosome components.

Identification of many instances of U12-type (minor-type) spliceosomal introns residing in identical gene positions to U2-type (major-type) introns in distantly-related species has provided evidence of evolutionary conversion between spliceosomal intron classes (Burge et al. 1998; Basu et al. 2008). To date, no instance of a conserved intron position being a conventional spliceosomal intron in one Euglena species and non-conventional in another species has been identified. Milanowski et al. did however recently discover the first case of a non-conventional intron containing 5′GC and AC3′ intron terminal sequences, the best candidate so far for an intermediate intron since both splice sites match those of conventional spliceosomal introns (Milanowski et al. 2016). They also identified a recently acquired non-conventional intron in the gapC gene in Euglena agilis that contains significantly longer extended intron boundary direct repeats than had previously been observed, leading them to propose that DNA double-strand break repair processes may be involved in intron emergence/acquisition in Euglena.

Only a very limited set of genes and small number of introns have been characterized in detail in Euglena. Recent extensive mRNA transcriptome studies under different physiological stress conditions (O’Neill et al. 2015; Yoshida et al. 2016; Ferreira et al. 2007) and future determination of more complete genome sequences from different Euglena species should permit a much more extensive analysis of intron evolution in euglenids and the detection of intron class conversion, if it occurs. Also important will be the identification of the cellular factors required for the removal of non-conventional introns , the experimental determination of critical intron structure and sequence requirements for splicing reactions , and the further identification of conventional spliceosomal components in Euglena. snRNAs have been identified but no experimental analysis of spliceosomal proteins or snRNP complexes has yet been performed.

2 Nuclear-Encoded Cytosolic rRNA

Expression and maturation of cytoplasmic ribosomal RNA in Euglena gracilis differs dramatically from what occurs in almost all other examined eukaryotes. The most striking feature is the cytoplasmic large subunit (LSU) rRNA, which in its mature form is fragmented into 14 discrete pieces, including the 5.8S rRNA (also called LSU1) (Schnare and Gray 1990). All 14 LSU fragment species along with the encoding sequence for the intact mature 19S rRNA of the small subunit (SSU) are encoded on an 11,056 base pair extrachromosomal DNA circle that is transcribed as a contiguous large RNA (read-around transcription) by RNA polymerase I (Greenwood et al. 2001; Schnare et al. 1990) . These rDNA circles number between 800 and 4000 copies per cell (Cook and Roxby 1985; Revel-Chapuis et al. 1985; Greenwood et al. 2001) and possess a single origin of DNA replication (Ravel-Chapuis 1988). The 19S, 5.8S (LSU1), and other 13 LSU rRNA fragments are separated by internal transcribed spacer (ITS) sequences ranging in size from 10 to 1188 base pairs in length, while LSU14 and the 19S SSU rRNA sequence are separated by an intergenic spacer (IGS) of 1743 base pairs (Greenwood et al. 2001). The spacer regions are removed post-transcriptionally producing a number of processing intermediates (Schnare et al. 1990; Greenwood and Gray 1998). Despite detection of these intermediate processing steps, very little is known about the mechanisms and components responsible for processing and maturation of the initial single transcript into each final rRNA species. Even the nearly universally conserved rRNA processing RNase MRP complex still remains uncharacterized (or detected) in Euglena (López et al. 2009). Ribosome assembly in E. gracilis is almost certainly highly complex and likely requires a number of novel processing components. A better understanding of E. gracilis ribosome assembly may shed light on how evolutionary processes have shaped the development of such a fragmented ribosome structure and perhaps even reveal insights about steps in more canonical eukaryotic ribosome assembly pathways. The only RNA species of the cytoplasmic ribosome not found on the rDNA circle is the 5S rRNA (Schnare et al. 1990), which is instead typically genomically-encoded within 600 base pair long tandem repeats with spliced-leader (SL) RNAs, at an estimated copy number of 300 repeated units per haploid genome (Keller et al. 1992). Evidence also suggests single copy 5S and SL genes are present, however these appear to be less conserved.

3 Euglena snoRNAs and Their Expression

The E. gracilis rRNA has the largest number of modified nucleotide positions of any rRNA examined to date. The SSU and LSU rRNA subunits contain 88 and 262 identified modifications respectively (Schnare and Gray 2011). Therefore, there is a significant increase in the density of modifications in the fragmented large subunit (LSU) rRNA species in E. gracilis relative to the non-fragmented SSU rRNA suggesting that the additional modifications may have an important structural stabilizing role and/or function in the more complicated ribosome biogenesis pathway in this organism. The majority of these modifications are 2′-O-methylations (Nm) (209) and pseudouridines (Ψ) (119) contradicting the usual trend of multicellular organismal rRNA being more heavily modified than that of simpler organisms. In addition to having conserved modifications at many positions also modified in other eukaryotes, E. gracilis also appears to contain a large number of species-specific and euglenozoan-specific modifications (Schnare and Gray 2011; Eliaz et al. 2015).

In eukaryotes, the two most prevalent modifications in rRNA are isomerization of uridine to Ψ and 2′-O-methylation (Li et al. 2016; Sharma and Lafontaine 2015). Most of these modifications are targeted by small guide RNAs called small nucleolar (sno) RNAs. SnoRNAs targeting Nm sites are called C/D box snoRNAs while those that target sites of Ψ formation are called H/ACA box snoRNAs, with both classes defined by conserved sequence and structural features (Bratkovič and Rogelj 2014; Lui and Lowe 2013). Since E. gracilis has so many modifications, the initial prediction was that it would also require a large collection of snoRNAs to specify all these modified sites. Identification of E. gracilis snoRNAs through biochemical, genomic amplification (PCR) strategies and bioinformatic analysis has revealed that this is indeed the case (Moore and Russell 2012; Russell et al. 2004, 2006). Not only are there a large number of different snoRNA species but also a very large collection of sequence-related isoforms of each species, the full extent of which has yet to be determined.

Elucidation of the organization of snoRNA genes in E. gracilis has revealed that these genes are usually tandemly repeated in the genome with genes for the two classes of snoRNAs interspersed (Moore and Russell 2012). This organization pattern is similar to what has been observed in several trypanosome species and various plant species (Barneche et al. 2001; Brown et al. 2003; Liang et al. 2005). The modified sites in E. gracilis rRNA are not evenly dispersed along the lengths of the rRNAs, but rather typically clustered and sometimes densely clustered, such as a region in LSU species 6 where in a stretch of 22 nucleotides nearly half are Nm (2′-O-methylated) (Schnare and Gray 2011). This modification pattern is related to the organization of snoRNA genes. We have identified several instances where adjacent or nearby genes encode snoRNA species that target adjacent rRNA modification sites (Moore and Russell 2012). How did such a situation arise? Many Euglena snoRNAs are encoded by tandemly repeated genes and when sequence divergence occurs in a paralogous gene copy that alters the guide region of a snoRNA, new base-pairing potential emerges to target a new modification site; that is, a new snoRNA species has been created. We have documented several cases where small insertion/deletions have occurred in nearby snoRNA gene copies that allows targeting of adjacent rRNA modification sites (Moore and Russell 2012). It seems that the apparent sequence repetitiveness in the E. gracilis genome, and the unexplained propensity to create gene copies, has been a driving factor in the creation of the large collection of snoRNA species and modification sites in this organism. However, what is not so clear is why this is selectively affecting modification of the various LSU rRNA species more than the SSU rRNA. Perhaps E. gracilis rapidly gains and then loses new snoRNA species through this genomic amplification mechanism but there is stronger selective pressure to retain snoRNAs targeting LSU fragment species as this is more beneficial for ribosome function in this organism. Also intriguing to consider is whether initially the fragmented nature of the E. gracilis rRNA necessitated a mechanism to rapidly create new snoRNA isoforms (snoRNAs targeting the same site) and species (those targeting different sites) or vice versa; fragmentation emerged as it could be tolerated in a cellular environment containing an unusually large number of snoRNAs with largely redundant functions.

Most of the E. gracilis snoRNA genes are expressed initially as polycistronic precursor transcripts of unknown lengths (we have detected transcripts upwards of 800 nts), containing several individual snoRNA sequences that are then processed into individual snoRNA species (Moore and Russell 2012). They are assembled with conserved core protein binding partners by an undefined processing and assembly mechanism in Euglena. Polycistronic transcripts containing both snoRNA classes have been detected. Transcription initiation and termination elements for expression of these genomic snoRNA clusters have yet to be determined; however, some of the spacer regions between mature snoRNA sequences display significant structural potential that may play a role in the expression mechanism (Moore and Russell, unpublished results). Not all E. gracilis snoRNAs are expressed polycistronically as the U3 snoRNA, a snoRNA that functions in pre-rRNA processing steps (i.e. specifying rRNA cleavage sites instead of targeting modification sites) appears to be expressed monocistronically (Greenwood et al. 1996; Charette and Gray 2009). Although the U3 snoRNA genes are multi-copy and frequently found associated with either U5 snRNA or tRNA genes, the U3 genes are in the opposite transcriptional orientation to the nearby U5 or tRNA genes (Charette and Gray 2009). Unlike U3, two other predicted E. gracilis processing snoRNAs, U14 and the Eg-h1 H/ACA-like RNA, are instead encoded by closely-spaced tandemly repeated genes like the modification-guide snoRNAs and are likely polycistronically expressed (Moore and Russell 2012). Therefore, there is no simple relationship between snoRNA function and expression mode in E. gracilis.

Currently, it is not definitively known which RNA polymerases are being used to express different snoRNA species in E. gracilis. In trypanosomatids and plants, U3 snoRNA genes are transcribed by RNA polymerase III (Fantoni et al. 1994; Kiss et al. 1991; Marshallsay et al. 1992), and the close linkage of some E. gracilis U3 genes with tRNAs suggests that at least these gene copies may be transcribed by this RNA polymerase. However, in trimethylguanosine cap pull-down RNA libraries we have found an abundance of E. gracilis U3 sequences consistent with these U3 species being transcribed by RNA polymerase II (Moore and Russell, unpublished results). Since not all E. gracilis U3 genes are linked with tRNA genes, it is possible that both RNA polymerases may be involved in U3 snoRNA expression depending on genomic context of individual U3 genes. The frequent expression of E. gracilis modification guide snoRNAs as polycistronic transcripts and relative transcript size is more consistent with RNA polymerase II transcriptional properties .

4 Euglena Chloroplast RNAs and Processing

Most recently, much of what has been deduced about Euglena choloroplast genome RNA-coding capacity has been through the determination of complete chloroplast genome structures from a collection of representative species from the Euglenaceae (Hrdá et al. 2012; Wiegert et al. 2012; Dabbagh and Preisfeld 2017; Bennett and Triemer 2015) and comparison to the much earlier determined choloroplast genome structure of Euglena gracilis Strain Z (Hallick et al. 1993). An examination of transcription patterns of the 96 genes contained on the E. gracilis plastid genome under different physiological states and stress conditions has also been performed (Geimer et al. 2009). Chloroplast RNA processing information has been derived primarily from Richard Hallick’s group. They identified and then examined splicing patterns of a large collection of chloroplast introns and investigated expression modes for rRNA and tRNA, and the chloroplast RNA polymerase activities required for their expression. Identification of any other chloroplast non-coding RNAs, and protein or ribonucleoprotein complexes involved in chloroplast RNA maturation will require future biochemical studies and other types of analyses.

4.1 Chloroplast rRNA and tRNA

In the two examined strains of E. gracilis, chloroplast rRNA is encoded in operons approximately 6000 nt in length. The operon codes for 16S, 23S, and 5S rRNA genes separated by internal transcribed spacers some of which contain tRNA genes or pseudogenes, an overall arrangement similar to many bacterial rRNA operons. The operon structure is tandemly repeated three times, with a fourth partial repeat containing only a complete 16S rRNA sequence and additional open reading frame (ORF) found in Strain Z but was not confirmed in var. bacillaris (Hallick et al. 1993; Bennett and Triemer 2015). These operons make up 13.7% of the length of the genome.

There are a total of 27 tRNAs (not including the pseudogenes) found in Strain Z which are actively expressed (Hallick et al. 1993). An additional 9 pseudo-tRNAs which do not appear to be transcribed are found in regions within the rRNA operon repeats. The bacillaris strain possesses 31 actively transcribed tRNA genes, with only 4 pseudogenes (Bennett and Triemer 2015). trnI-trnA genes are co-transcribed with the rRNA operons and are the only chloroplast tRNAs that are multicopy. Most of the tRNA genes reside in clusters with short spacers, sometimes closely-linked with protein-coding genes.

There are at least two different RNA polymerase activities in E. gracilis chloroplasts that can be biochemically separated and are active when used in in vitro transcription assays (Greenberg et al. 1984). They display differences in enzymatic properties including salt concentration tolerance, optimum Mg2+ concentrations and temperature activity profiles. The RNA polymerase activity that remains tightly associated with chloroplast genomic DNA has been shown to selectively transcribe the rRNA operons (Greenberg et al. 1984). The soluble RNA polymerase activity transcribes most of the chloroplast tRNAs excluding those that are contained within the rRNA operons. Specificity of these RNA polymerase activities for transcribing the various protein-coding genes has not been extensively examined.

Polycistronic transcription and subsequent processing of these extended transcripts appears to be a prevalent mode of gene expression in E. gracilis chloroplasts for transcripts produced by either RNA polymerase activity (Christopher and Hallick 1990; Greenberg and Hallick 1986). Greenberg and Hallick were first able to isolate E. gracilis soluble chloroplast extracts that were capable of transcribing polycistronic transcripts containing multiple tRNA species that also accurately processed these primary transcripts to generate mature tRNA 5′ and 3′ termini (accurate CCA 3′ end addition was not verified in this study) (Greenberg and Hallick 1986). Either chloroplast DNA or cloned tRNA genes served as appropriate transcription and subsequent processing substrates for the soluble extracts. Christopher and Hallick then demonstrated that polycistronic transcription also occurs for chloroplast ribosomal protein genes where one transcription unit was characterized that contains 11 rprotein genes, an isoleucine tRNA gene, and an ORF of unknown function (Christopher and Hallick 1990). This transcription unit is also predicted to contain at least 15 introns making it a large polycistronic transcription unit and complex gene expression pathway. It appears that the tRNA is processed and matured from this large transcript, as opposed to alternative individual transcription of the tRNA as a nested transcription unit, since the spacers flanking the mature sequence are short and do not appear to contain obvious promoter or termination elements. The authors noticed that the codon that would be deciphered by this particular isoleucine tRNA isoacceptor is enriched in mRNAs coding for constitutively expressed proteins (such as ribosomal proteins) relative to the codon’s frequency in mRNAs for light-induced proteins . They speculate this may be the reason for this tRNA residing in this particular polycistronic unit. Through detection of RNA processing intermediates and products via nucleic acid hybridization experiments, it appears that RNA endonucleases are utilized for liberating individual RNA species from the polycistronic transcript and also for other transcription units containing tRNA species. A prediction would be the key involvement of the tRNA 5′ end maturation endonuclease RNase P in the various polycistronic transcript processing pathways .

4.2 Chloroplast Introns

An unusual feature of the Euglena chloroplast genome structure is the very large number of introns. Surveys of Euglenaceae chloroplast genome sequences have revealed a high degree of variability in intron content (Bennett and Triemer 2015; Pombert et al. 2012). The two sequenced E. gracilis chloroplast genomes possess the greatest number of introns in this taxa with the strain Z chloroplast containing 155 introns and var. bacillaris containing 134. This results in 66.7 and 68.3% of protein coding genes containing at least 1 intron in the two strains, respectively (Thompson et al. 1995; Bennett and Triemer 2015; Hallick et al. 1993). Curiously, despite this high intron content, none of the Euglena chloroplast tRNA genes contain introns. This differs markedly from what is found in green algae where over 50% of tRNA genes contain introns.

Chloroplast introns in E. gracilis include members of both group II (self-splicing) introns and a unique related class designated group III introns (Copertino and Hallick 1993). The E. gracilis group II introns contain most of the conserved features of this class of introns including structural domains I-VI (Fig. 8.1c), EBS-IBS pairings, and predicted ε-ε’ and γ-γ’interactions (Copertino and Hallick 1993). These introns are however A-U rich (striking scarcity of G-C base-pairs in some cases) and show some structural “looseness” and variability relative to those introns found in more distantly related organisms. The group III introns appear to be degenerate or minimalized group II introns that contain only domain VI (predicted catalytic and branch point ‘A’ containing) and domain I; although even this later domain can be very minimalized in some predicted group III intron structures (Fig. 8.1c). Since in vitro splicing assays have not been performed with any of these Euglena introns, it is not known which of them are in fact self-splicing. It seems probable that the group III introns (at least) may have degenerated to the point where they are now completely dependent on trans-acting protein and/or RNA splicing factors for either or both of the two transesterification reactions, assuming they use such a splicing pathway.

Euglena chloroplast group II and group III introns can be found individually or as so-called twintrons : introns interrupting introns (Hallick et al. 1993; Bennett and Triemer 2015). Twintrons have been identified containing pairs of group II or group III introns, group II interrupting group III (and vice-versa), and even arrangements containing larger numbers of nested introns than just two. Hong and Hallick (1994) identified a case of a twintron arrangement in the E. gracilis ycf8 gene where the outer intron can be a group II intron interrupted by two spaced group II introns; that is, two introns each inserted at different locations within the outer intron or alternatively this outer intron can be classified as a group III intron interrupted by a group II intron. Alternative splicing dictates which combination of introns are removed and if the group II + III intron combination is removed, this pathway prevents removal of the outer group II intron by truncating several key structural regions.

The strict definition of a twintron , as defined for example by Hafez and Hausner (2015), is an embedded arrangement where the inner intron must be removed first to allow formation of the correct structure that catalyzes removal of the outer intron. In many of the Euglena twintron arrangements the insertion site of the inner intron is in domain V or VI of a group II intron , insertion positions that would be predicted to disrupt the tertiary structure required for outer intron removal in other well-studied group II introns. However, these Euglena group II introns already show some structural differences and flexibility relative to those studied in other organisms and together with the existence of the structurally minimalized group III introns, it may be premature to assume strict adherence to an ordered splicing pathway for all Euglena twintron arrangements. The frequency of twintrons in Euglena chloroplast genomes and the overall large number of introns suggests that intron mobility and insertion into new genomic sites is a relatively common occurrence in E. gracilis and more prevalent than is seen in other euglenids (Thompson et al. 1997; Pombert et al. 2012)—many of these introns appear to be unique to E. gracilis. Through recent determination of chloroplast intron structure and location in Monomorphina aenigmatica, a species occupying an intermediate branching position in euglenids, Pombert et al. (2012) have provided further evidence that group II/III intron abundance in Euglena gracilis appears to have resulted from more “recent” proliferation events, including the establishment of twintron arrangements (Hrdá et al. 2012; Wiegert et al. 2012). They found cases of intermediate stages of intron evolution in which M. aenigmatica contains a single group II intron (i.e. no twintron arrangement) inserted at the same gene position as the outer intron of a twintron arrangement in Euglena gracilis. The maintenance of twintron arrangements is the strongest argument so far for ordered splicing pathways; that is, insertion into a site that disrupts splicing of the outer intron requires first removing the inner intron to prevent gene function inactivation that would otherwise be the result of the insertion event.

It is curious that both the E. gracilis nuclear and chloroplast genomes are so intron-rich and also contain intron classes not known to exist outside of euglenids . We may then speculate about whether there is an evolutionary relationship between the non-conventional nuclear introns and the chloroplast group III introns, both of which maintain few conserved intron structural features for their respective splicing mechanisms . Were the non-canonical introns the end result of a large scale invasion event of the nuclear genome by group III mobility elements derived from an ancestral euglenid chloroplast? A detailed understanding of the splicing mechanisms and components involved for removal of these different intron types, and a large- scale analysis of introns in E. gracilis and other euglenids may reveal new insights into intron evolution in eukaryotes and the importance of these various intron classes in regulating gene expression in these organisms.

Perhaps the most surprising feature of gene expression in E. gracilis chloroplasts is the fact that there appears to be little differential variation in RNA species level when cells are examined at different stages of development and/or subject to various stress-inducing agents (Geimer et al. 2009) This is somewhat unexpected considering the complexity of processing required to remove the large number of introns in precursor chloroplast transcripts and the unusual adaptability of this organism in general to adjust to a wide range of environmental fluctuations. It was observed however that there can be significant changes to global chloroplast RNA levels under these various tested conditions. Such observations may indicate that if differential changes are occurring at the proteome level in E. gracilis chloroplasts, the regulation may be occurring at the translational control level.

5 Mitochondrial Genome Structure , Expression, and RNA Editing

RNA processing in Euglenozoan mitochondria has been shown to be both mechanistically unique and amazingly diverse compared to other eukaryotic phyla. The three major groups within Euglenozoa : euglenids, kinetoplastids, and diplonemids show a broad range in mitochondrial chromosome structure, gene expression strategy , and RNA processing mechanisms. Comparatively little is currently known about euglenid mitochondria; in particular, until recently virtually nothing was reported about E. gracilis mtDNA structure. It now appears that there are significant differences in E. gracilis compared to mitochondria in the other Euglenozoan taxa. An understanding of these other Euglenozoans may then provide evolutionary insight into mitochondrial features in this phylum. Further analysis of mitochondrial DNA and RNA features in E. gracilis itself and other euglenids will be indispensable in understanding RNA maturation and genome structure in these species. Here, we put current knowledge of E. gracilis mitochondrial chromosome structure , RNA expression and processing, in the broader context of Euglenozoans collectively.

Diplonemid mitochondrial DNA is arranged into two classes of small circular chromosomes of different sizes, Class A (6 kbp) and Class B (7 kbp) (Marande et al. 2005). mRNAs in diplonemid mitochondria are not expressed as single contiguous transcripts but rather as short fragments (known as modules) of several hundred nucleotides (Kiethega et al. 2011; Vlcek et al. 2010; Marande and Burger 2007). Each module is encoded by a different chromosome that carries only that gene. Following expression the module transcripts require processing through endonucleolytic cleavage, polyadenylation of the 3′ module, and trans-splicing in order to form mature full length transcripts (Kiethega et al. 2013). The mechanism through which this trans-splicing occurs is not yet understood, though it has been proposed that small guide RNAs may help in facilitating this process (Kiethega et al. 2013; Moreira et al. 2016). Additional editing of modules may also occur, including addition of short uridine stretches (1–3 nucleotides) to module ends, as well as both C-to-U and A-to-I editing (Moreira et al. 2016). In the second major Euglenozoan group, the kinetoplastids, mitochondrial DNA (termed kinetoplast or kDNA) is also arranged into two classes of circular chromosomes. In contrast to diplonemids, kinetoplast chromosomes differ quite significantly in size and are classified as either large (maxicircles) or small (minicircles) (Riou and Delain 1969; Kleisen et al. 1976; Steinert and Van Assell 1975). Maxicircle copy number varies between species, from 25 to 50 copies per cell in examined species, while thousands of minicircles can be present. Kinetoplastid maxicircle chromosomes primarily carry the mitochondrial protein-coding and rRNA genes (Eperon et al. 1983; Westenberger et al. 2006; Simpson et al. 1987). Minicircles code for small guide RNAs (gRNA) (Pollard et al. 1990; Corell et al. 1993; Jasmer and Stuart 1986a, b; Deschamps et al. 2011) which form ribonucleoprotein complexes called editosomes that act in a unique form of uridine insertion/deletion (U indel) editing of mRNA. This form of U indel editing has made gene identification difficult as the gene sequence may have little resemblance to the mature edited mRNA, and up to 553 insertion and 89 deletion sites have been characterized for a single transcript (Koslowsky et al. 1990).

The Euglena gracilis mitochondrial genome is also atypical but appears to be quite different from those of other Euglenozoans. Rather than circular chromosomes as seen in the diplomenids and kinetoplastids, E. gracilis possesses a collection of heterogeneous linear chromosomes ranging in size from a distribution peak at 4 kbp, up to 8 kbp (Spencer and Gray 2011; Dobáková et al. 2015). Only seven protein coding genes (cox1, cox2, cox3, cob, nad1, nad4, and nad5) have been identified in the genome (Dobáková et al. 2015; Tessier et al. 1997; Yasuhira and Simpson 1997). This is predicted to be the full complement of protein-coding genes in the mtDNA, with the remaining proteins likely encoded in the nuclear genome. Comparison of the gene sequence and corresponding mRNA for these genes shows no evidence that editing or splicing is required for the formation of mature transcripts (Dobáková et al. 2015; Spencer and Gray 2011). This is quite surprising as unique and extensive mRNA editing appears to be a core feature of RNA maturation in the mitochondria of many other Euglenozoans. A second surprising feature of E. gracilis mtDNA is that in addition to full-length versions of mitochondrial genes, there are also many small mRNA and rRNA gene fragments scattered throughout the genome (Spencer and Gray 2011). These fragments retain high sequence identity to segments of the full length genes, in some cases even being perfect matches, but do not appear to be expressed. These small fragments and the presence of many short direct repeats have been proposed as possible evolutionary predecessors to the minicircle-encoded gRNAs of kinetoplastids, possibly produced through recombination between flanking repeats to produce “guide-like recombination products” (Spencer and Gray 2011). Transcription of the complementary strand of the gene fragments could then result in anti-sense RNAs capable of base-pairing to mRNAs, potentially allowing sequence drift in protein coding regions that could be corrected by RNA editing.

The mitochondrial genomes of two other euglenids, Peranema trichophorum and Petalomonas cantuscygni have been examined using electron microscopy (Roy et al. 2007). These results show that the P. trichophorum genome consists of many linear DNA molecules ranging from 1 to 75 kbp in size. In contrast, P. cantuscygni possesses linear 40 kbp molecules, with a small number of circular 40 kbp and much smaller 1–2.5 kbp molecules. More comprehensive examination of mitochondrial genome structure and content in other euglenids will indicate whether linear chromosomes are the predominant form and whether RNA editing is present in euglenids other than kinetoplastids.

The diversity found in structure and transcript processing in Euglenozoan mitochondria raises many questions about the evolutionary history that gave rise to these various states. Flegontov et al. have suggested that the genome of the Euglenozoans last common ancestor (ELCA) was likely circular and that the diversity found in this phylum may have arisen through constructive neutral evolution (Flegontov et al. 2011). It will be important to examine more representatives of all three major groups to determine the extent of possible genome types and novel mechanisms for RNA processing in these organelles.

5.1 Mitochondrial Ribosomal RNA

Ribosomal RNA structure and processing in Euglenozoan mitochondria is also highly variable. The mitochondrial SSU and LSU RNAs from E. gracilis have been identified and each appears to be expressed as two separate RNAs, termed SSU-R/SSU-L and LSU-R/LSU-L (Spencer and Gray 2011). Both SSU rRNA fragments have been sequenced and found to be chromosomally-unlinked independently transcribed genes, rather than products of cleavage of a single initial contiguous pre-SSU rRNA transcript. Extensive analysis failed to detect any full-length mature SSU RNAs providing strong evidence that these bipartite RNAs represent the mature fragmented functional form of this rRNA, not being further processed through a trans-splicing pathway to form a single contiguous SSU RNA. The 3′ end of SSU-R shows little heterogeneity. The SSU-L consists of three variants containing between 1 to 3 terminal A’s at its 3′ end. Two LSU fragments have also been identified and found to have discrete 3′ ends; however, full length genomic encoding regions could not be located for either fragment. While it is likely that these represent the functional mitochondrial LSU RNAs, further analysis will have to be done to determine whether, like the SSU, each fragment is encoded individually and contiguously in the genome. Evidence has also been found that both the LSU and SSU contain modified nucleosides, including two tandem N 6 ,N 6-dimethyladenosines and an N 4-methylcytidine in the SSU and a Ψ in the LSU (Spencer and Gray 2011). Structural modeling has been performed for both the SSU and LSU fragments. The SSU fragments were found to form conserved long range base-pairing interactions resulting in the formation of a secondary structure with similar features to the eubacterial 16S SSU rRNA. The first several hundred nucleotides of the 5′ end of the SSU show the greatest divergence in structure as a result of a high A + T content. LSU terminal regions also showed great similarity to the eubacterial 23S LSU RNA. In comparison, kinetoplast rRNA secondary structure has been found to be even more divergent from the eubacterial rRNA structure and in fact shows relatively little structural similarity to the E. gracilis rRNA.

Fragmented mt-rRNA has also been identified in the diplonemid Diplonema papillatum. Like E. gracilis, two LSU fragments (534 and 352 nt) are present, encoded on two Class B chromosomes (Valach et al. 2014). These RNAs are trans-spliced to produce a single LSU rRNA of approximately 900 nt and appear to go through other additional processing steps. The 3′ fragment contains a poly-A tail that is not present in the mature spliced transcript nor encoded in the gene sequence. The presence of this transient poly-A stretch raises questions about possible extended poly-A tail processing intermediates for the E. gracilis SSU-L, considering the observed variable 3′ ends (see above). In Diplonema papillatum, a 26 nucleotide poly-U stretch is found separating the 5′ and 3′ portions of the mature spliced LSU that is not encoded in the genes for either LSU fragment indicating that a process related to uridine insertion into mRNA modules can also occur to diplonemid rRNA. A short 366 nt RNA has been proposed as a potential mitochondrial SSU rRNA, but as of yet it is unclear if this represents the entire SSU rRNA or an individual fragment (Moreira et al. 2016). Small rRNAs are not unheard of in Euglenozoans. The kinetoplastid species Trypanosoma brucei (Sloof et al. 1985; Eperon et al. 1983), Leishmania tarentolae (de la Cruz et al. 1985a, b), and Crithidia fasciculata (Sloof et al. 1985) possess the smallest yet identified mitochondrial rRNAs, composed of a 9S SSU (approximately 611–640 nt) and 12S LSU (approximately 1141 and 1230 nt), each expressed as a contiguous transcript from a single gene. It will be important then to examine the SSU in D. papillatum and determine if other fragments are required or if the single SSU rRNA represents a potentially minimal rRNA.

In summary, while recent studies have begun to identify key features of the E. gracilis mitochondrial genome, our current knowledge about its structure and expression is still lagging somewhat behind what has been elucidated for other Euglenozoans. Continued efforts to characterize Euglena RNAs will be required to both further investigate the possibility of unique processing mechanisms and define the full complement of mtDNA encoded genes, including the LSU subunits.

6 Spliced-Leader RNA

Spliced-leader trans-splicing is a process through which a short RNA sequence (called the spliced-leader exon) is added to form the 5′ end of nuclear pre-mRNAs in a spliceosome-dependent manner. A small non-coding RNA termed the spliced-leader (SL) RNA acts as the donor of the short sequence. It is composed of two regions: the spliced-leader exon at its 5′ end followed by an extended sequence termed the spliced-leader intron (Fig. 8.2a), that is not included in the mature mRNA but is important for forming interactions with the target mRNA. SL RNAs fold into stem-loop secondary structures and contain an internal Sm-protein binding site, similar to what is observed in several of the small nuclear RNAs (snRNAs) of the spliceosome . The 5′ splice site required for the splicing reaction is part of the SL RNA, while the branch point adenosine, polypyrimidine tract and 3′ splice site are located at the 5′ end of the precursor mRNA collectively referred to as the “outron” (Fig. 8.2b, c). Together with the spliceosome components these elements form a substrate competent for splicing.

Fig. 8.2
figure 2

(a) General structure of a spliced-leader RNA . Spliced leader trans-splicing can add the spliced-leader exon cap structure to (b) monocistronic transcripts or (c) to liberate individual protein-coding RNAs contained within a precursor polycistronic transcript. Both processing pathways result in (d) capped individual transcripts and removed Y shaped introns made up of the spliced-leader intron sequence attached to the mRNA outron. ss = splice site

Spliced-leader trans-splicing was described in euglenids (Tessier et al. 1991) following initial discovery in trypanosomatids (Boothroyd and Cross 1982; Sutton and Boothroyd 1986; Milhausen et al. 1984) and nematodes (Krause and Hirsh 1987). Addition of the spliced-leader serves a number of purposes in different groups of organisms . In both C. elegans (Spieth et al. 1993) and trypanosomes (Muhich and Boothroyd 1988), addition of the spliced-leader exon acts as a mechanism for processing and capping of individual mRNAs contained within long polycistronic precursor transcripts, as well as for capping monocistronic transcripts (Johnson et al. 1987; Zorio et al. 1994) (Fig. 8.2b–d). Analysis of the E. gracilis transcriptome has estimated that approximately 56% of pre-mRNA transcripts undergo spliced-leader exon sequence addition (Yoshida et al. 2016). Little is currently known about whether Euglena mRNAs are transcribed mono- or polycistronically and therefore the various roles of SL splicing in Euglena remains to be determined.

The E. gracilis genome encodes at least six variants of a spliced-leader RNA. Each isoform is approximately 101 nucleotides in length, the first 26 nucleotides of which is the SL exon sequence that is added to the pre-mRNA transcript (Tessier et al. 1991). The specific type of cap structure (and extent of modification) of the Euglena spliced-leader RNA is unknown. In most organisms in which spliced-leader trans-splicing occurs, the SL RNA possess a 2, 2, 7 trimethylguanosine (TMG) cap. Trypanosome SL exons possess a unique type of cap structure termed ‘cap 4’ containing extensive modifications; including 7-methyl guanosine, 2′-O-methylation of the first four nucleotides, additional base methylations at the first and fourth nucleotides, and a Ψ at position 28 (Zamudio et al. 2009). The relatively close phylogenetic relationship between trypanosomes and E. gracilis suggests that a number of these modifications may also be present in Euglena. Information on Euglena SL cap structure may be lacking in part because recent studies indicate that there are an additional two nucleotides at the 5′ end of the SL RNA exon from what was previously reported (our unpublished results). This is critical as these would be the nucleotides containing most of modified nucleotide positions for these RNAs. These additional nucleotides also prompt further questions about how SL RNA is expressed in E. gracilis and whether initial processing of pre-SL RNA may occur prior to capping and splicing. We are learning an increasing amount about SL trans-splicing in E. gracilis but the exact role and structure of this RNA requires further elucidation.

7 RNase P

Efficient and accurate processing of pre-tRNA molecules from both nuclear and organellar genomes is crucial for the production of functional tRNA molecules. RNase P is a key endonucleolytic enzymatic complex responsible for the maturation of the 5′ ends of tRNAs. Found in all three domains of life, RNase P most commonly functions as a ribonucleoprotein complex containing a single RNA (RNase P RNA) which is the catalytic component (Guerrier-Takada et al. 1983; Pannucci et al. 1999; Thomas et al. 2000; Kikovska et al. 2006), and a variable number of proteins depending on the species. A small number of protein-only RNase Ps (PRORP) have also been identified, primarily confined to the organelles of eukaryotes (Holzmann et al. 2008; Gobert et al. 2010). Interestingly however, several members of the phylum Euglenozoa appear to only possess protein-only versions of RNase P. A single predicted PRORP protein has been identified in Euglena mutabilis and many trypanosome species possess nuclear (PRORP1) and mitochondrial (PRORP2) protein-only enzymes that have been shown to accurately process 5′ tRNA ends in vitro in the absence of any additional protein or RNA factors (Lechner et al. 2015; Taschner et al. 2012). To date, no RNase P has been reported for the plastid or nuclear genomes of Euglena gracilis (Lechner et al. 2015). However, when we performed a blastp search using Trypanosoma brucei PRORP proteins it revealed a putative PRORP protein in the E. gracilis proteome database published by O’Neill et al. (2015) that contains both PRORP and PPR motif repeat domains like those found in other protein- only RNase Ps (our unpublished results) . Whether this PRORP-like protein in E. gracilis possesses tRNA processing activity and whether or not E. gracilis also possesses an RNA dependent RNase P activity will need to be examined. The distribution of the apparent utilization of PRORP enzymes in Euglenozoa suggests that dependence on RNase P RNA may have been lost early in the evolution of this phylum.

8 Conclusions and Future Directions

RNA transcript processing in Euglena displays remarkable diversity when compared to similar processes in distantly-related eukaryotes but also compared to its most closely-related studied relatives, the kinetoplastids and diplonemids. While some information is now available about processing of select classes of Euglena RNA in nuclei, mitochondria and chloroplasts, our knowledge is still limited due to a lack of characterization of RNA processing factors and an incomplete understanding of nuclear and mitochondrial genome structure . These will be key research areas to investigate in the future that should be aided by advances in proteomics and high-throughput nucleic acid sequencing technologies. Also important will be the isolation and characterization of classes of non-coding RNAs through RNA-Seq approaches that will give a more complete picture of the abundance and diversity of non-coding RNA types in different Euglena species. So far, key elucidated features are that polycistronic transcription is a common gene expression strategy for several classes of Euglena nuclear and chloroplast RNA (unknown for mitochondrial transcripts at this stage), that both cytoplasmic and mitochondrial rRNA is unusually structurally fragmented (extensively so for cytosolic LSU rRNA), that novel introns are prevalent in both organelle and nuclear genes, and that non-coding RNAs and their sequence isoforms are apparently very abundant in Euglena which seems to be related to the repetitiveness of its nuclear genome structure. What roles might these unusual RNA structural features and transcript processing mechanisms have on the environmental adaptability of Euglena? Additional surprising features and mechanisms will likely be discovered when we continue our efforts to study this fascinating genus that may provide new insights into the evolution of RNA and protein-RNA complexes in all organisms.