Introduction

RNA editing is a post-transcriptional event that alters the content of plant organellar RNAs. The specific RNA positions affected by RNA editing, along with their corresponding DNA positions, are usually known as editing sites. In angiosperms, this post-transcriptonal event substitutes cytidines (C) for uridines (U) mainly in mitochondrial transcripts, in which ca. 400 editing sites have been reported in Arabidopsis (Bentolila et al. 2013; Giegé and Brennicke 1999), in comparison to the ca. 30 editing sites found in chloroplasts (Chateigner-Boutin and Small 2007; Lutz and Maliga 2001; Fujii and Small 2011; Tillich et al. 2006). Most editing sites are located in protein-coding genes but also in non-coding regions, tRNAs, introns, or UTRs (Castandet et al. 2010; Grimes et al. 2014).

Editing sites are surrounded by a sequence motif involved in the binding of the editosome (Cummings and Myers 2004; Farré et al. 2001; Giegé and Brennicke 1999; Mulligan et al. 2007; Neuwirt et al. 2005). The editosome is composed of distinct nuclear-encoded proteins with variable specificity in editing site recognition (Bentolila et al. 2013; Takenaka et al. 2012). The PLS-class proteins (Lurin et al. 2004) belong to the family of the pentatricopeptide repeat (PPR) proteins (Aubourg et al. 2000; Small and Peeters 2000), and are involved in the recognition of specific editing sites (Barkan et al. 2012; Takenaka et al. 2013; Yagi et al. 2013b). In protein-coding genes, editing sites are predominantly found at non-synonymous positions: more frequently at second followed by first codon positions (Cuenca et al. 2010; Giegé and Brennicke 1999; Mulligan et al. 2007; Sloan et al. 2010; Picardi et al. 2010; Yura and Go 2008). RNA editing can change the genomically encoded amino acids (Covello and Gray 1989; Gualberto et al. 1989; Hiesel et al. 1989; Mower 2009). Such changes tend to increase overall amino acid conservation, modify physicochemical properties (Giegé and Brennicke 1999; Jobson and Qiu 2008; Tillich et al. 2006; Kugita et al. 2003) and even alter protein folding (Yura and Go 2008; Yura et al. 2009), which means that editing sites are essential for the proper function of proteins.

The frequency of editing sites is uneven among land plants. With no editing sites reported in green algae, more than 2000 sites have been identified in lycophytes (Hecht et al. 2011), about 500 sites in gymnosperms (Salmans et al. 2010), and between 200 and 700 sites in angiosperms (Mower 2009; Mower and Palmer 2006; Sloan et al. 2010; Richardson et al. 2013). Among angiosperms, early-diverging lineages show the highest numbers of editing sites (Rice et al. 2013; Richardson et al. 2013). In contrast to other land plants, angiosperms are undergoing an extensive loss of editing sites through the substitution of genomic editable cytidines to thymidines (Cuenca et al. 2010; Mower 2008; Shields and Wolfe 1997; Sloan et al. 2010). Retroprocessing, i.e. gene conversion with a cDNA generated by reverse transcription of an edited RNA seems to be one of the molecular mechanisms responsible for editing site loss (Bowe and dePamphilis 1996; Cuenca et al. 2010, 2016; Hecht et al. 2011; Mulligan et al. 2007; Parkinson et al. 2005; Sloan et al. 2010). However, a key role of point mutations has also been proposed for the loss of editing sites favored by natural selection (Mower 2008).

To identify editing sites, C-to-U mismatches are detected by comparing RNAs with their corresponding DNA templates. For this purpose, Sanger sequencing of cDNAs has been widely used in the last two decades (Giegé and Brennicke 1999; Gualberto et al. 1989; Takenaka and Brennicke 2003, 2007), though it is time-consuming and prone to underestimate editing sites (Mower and Palmer 2006). In recent years, next-generation sequencing of transcriptomes (RNA-seq) has been used as an alternative approach. One of the main advantages of RNA-seq based identification is that it allows a fast and large-scale recognition of editing sites over whole genomes (Cuenca et al. 2016; Fang et al. 2012; Grewe et al. 2014; Grimes et al. 2014; Islam et al. 2013; Kazakoff et al. 2012; Park et al. 2014; Picardi et al. 2010; Sahraeian et al. 2017; Shearman et al. 2014; Wu et al. 2015). In this approach, the mapping of RNA-seq reads to genomic sequences is crucial to identify editing sites and their editing levels accurately (Bahn et al. 2012; Diroma et al. 2017). This is a challenging task because stringent mapping settings may lead to false negatives (Guo et al. 2015), while more relaxed settings may increase the number of false positives. As a result, different strategies have been discussed to improve the detection of RNA editing sites (Stone and Štorchová 2015; Guo et al. 2015; Diroma et al. 2017). Nevertheless, the detection of RNA editing events through RNA-seq data has enormous potential to deepen our knowledge of the transcriptional process in plant mitochondria.

To gain a better understanding of RNA editing in angiosperm mitochondria, this study identified the editing sites of diverse angiosperm mitochondria by systematically exploiting publicly available RNA-seq data. Through an in-house computational pipeline, 10,217 C-to-U editing events were identified in all protein-coding genes of 17 mitochondrial genomes belonging to five angiosperm lineages: magnoliids, monocots, basal eudicots, rosids, and asterids. Many of these sites have not been previously reported and represent a valuable data set for the research community. We aim to: (i) analyze the distribution of editing sites across the main angiosperm lineages, and a wide range of mitochondrial genes; (ii) study the amino acid changes produced by RNA editing; (iii) examine the conservation of editing sites across angiosperm lineages; (iv) compare the editing extent at synonymous and non-synonymous editing sites; (v) evaluate the role of retroprocessing in the loss of editing sites; and (vi) study the dynamics of editing site loss and gain along the evolution of angiosperm mitochondria.

Results

Identification of C-to-U editing sites

To identify editing sites, a sample of diverse angiosperm mitochondria was analyzed. The selection of angiosperms was based on the complete mitochondrial genomes publicly available in the GenBank organellar genome databaseFootnote 1 and on the availability of transcriptomic data in the SRA (Sequence Read Archive) databaseFootnote 2 at NCBI. From those lineages with genomic and transcriptomic data available, we selected 17 diverse angiosperm species Footnote 3 (Fig. 1) so that: (i) there was sufficient transcriptomic data; (ii) they did not have odd mutation rates, such as Silene spp. (Sloan et al. 2010); and (iii) they were evenly distributed across the main angiosperm lineages. By using an in-house computational pipeline (“Identification of RNA editing sites from transcriptomic data” section), 10,217 C-to-U mismatches, or editing sites, were identified by comparing transcriptomic data with mitochondrial protein-coding sequences (ESM 1). For the comparative analyses described below, we used the 8318 editing sites found in the 24 protein-coding genes conserved across the 17 angiosperm mitochondria (ESM 2). These editing sites were annotated as uppercase C nucleotides in the DNA sequence alignments of each protein-coding gene (ESM 3).

Fig. 1
figure 1

Total number of editing sites in 24 protein-coding genes across 17 angiosperms. Stacked bars depict numbers of editing sites at first (blue), second (green), and third (red) codon positions, respectively. Solid and dashed red vertical lines represent average and standard deviation values, respectively, calculated for the angiosperm lineages asterids, rosids, and monocots

Given that for some of the studied angiosperms RNA editing sites had been previously identified by Sanger sequencing of RT-PCR products, we were able to make comparisons with our identification method based on RNA-seq data (ESM4). The editing sites we detected in Arabidopsis, Liriodendron, Nicotiana, and Oryza were contrasted to those obtained previously by Sanger sequencing of cDNAs (Giegé and Brennicke 1999; Richardson et al. 2013; Sugiyama et al. 2005; Notsu et al. 2002). On average, there was an agreement of about 82, 82, 83, and 80% in Arabidopsis, Liriodendron, Nicotiana, and Oryza, respectively (ESM 4). The majority of the disagreements represented novel editing sites (i.e., sites identified in this study), a common result observed in related studies (Grimes et al. 2014; Picardi et al. 2010). On average, about 45% of the disagreements were at third codon positions in the four species, where editing sites showed poor conservation across plant species (Bentolila et al. 2013; Guo et al. 2017; Mower and Palmer 2006; Picardi et al. 2010; Stone et al. 2017; Wu et al. 2015). The disagreements caused by the sites identified by RT-PCR but not by RNA-seq showed an average editing extent between 0 and 7.5%, indicating that they were overlooked by our identification method owing to their low editing extents, i.e. < 10%, or low read-depth, i.e. < 10 reads (“Identification of RNA editing sites from transcriptomic data” section). The 50–70% of the novel editing sites identified by RNA-seq were supported by conservation levels about 50% on average, across the 17 angiosperm mitochondrial genomes, or predictions made by PREP-Mt (Mower 2009) under a cutoff of 0.2 (ESM 4).

Identification of other types of mismatches

Following the same procedure to identify the 8318 C-to-U mismatches, a total of 599 unconventional mismatches were also recognized in the 17 × 24 protein-coding genes. Overall, C-to-U mismatches were the most frequent, representing 93% of all the mismatches found, followed by G-to-A (1.2%) and G-to-U (0.8%) (ESM5A). Almost 50 unconventional mismatches were located in exon–intron boundaries in the genomic sequences of several intron-bearing genes, and were supported by > 85% of the reads and by at least two SRA experiments. After detailed analyses, we concluded that they were likely the result of errors in the annotation of splicing sites. Other mismatches supported by 100% of the reads were likely due to polymorphisms between the individuals used for the genomic and transcriptomic analyses (ESM5B). For example, 98 of the 153 (64%) unconventional mismatches found after mapping RNA-seq data of Liriodendron chinense on the mitochondrial genome of L. tulipifera were supported by 100% of reads in at least one SRA experiment (ESM5B). The remaining mismatches could be the result of: (i) dual transcription of nuclear-encoded mitochondrial genes (or pseudogenes) and mitochondrial-encoded copies (Adams et al. 1999; Daley et al. 2002; Picardi et al. 2010); (ii) RNA sequencing errors (in particular, those mismatches found only in reads from one SRA experiment and in sites containing more than one type of mismatch) (Wu et al. 2015); or (iii) non-canonical editing processes. Indeed, a large number of mismatches (50 in total) were found in the protein-coding region of Glycine cox2 due to the documented simultaneous expression of the nuclear and mitochondrial copies of the gene cox2 in Glycine (Adams et al. 1999; Daley et al. 2002). It is possible that other nuclear-encoded copies are also being transcribed simultaneously with the mitochondrial-encoded homologs. Recently, other examples of dual transcription of homologs in the nuclear and mitochondrial genomes have been reported: rps13 in Silene vulgaris (Sloan et al. 2012), sdh4 in Salicaceae and Lupinus (Choi et al. 2006; Havird et al. 2016), rpl5 and rps19 in Poaceae (Atluri et al. 2015; Wu et al. 2017). We identified a few candidates for dual transcription (Nicotiana atp1 and matR, Daucus nad3) based on the fact that those genes showed mismatches at several positions and were supported by more than two SRA experiments, as observed for Glycine cox2. Sequence similarity searches of DNA-seq data revealed regions showing high similarity to genes of mitochondrial origin in the nuclear genome of tobacco and carrot. Interestingly, these sequences contained the mismatches identified by RNA-seq in the genes mentioned above. Further experimental data is needed to test the function of the nuclear transcripts of the mitochondrial genes. Overall, we did not find solid evidence of non-canonical editing processes.

Variability of RNA editing among plants and genes

The total number of editing sites in each plant species was calculated by adding the editing sites of the 24 protein-coding genes together. The resulting totals varied two fold across the angiosperm phylogeny (Fig. 1 and ESM 2). This variability has been reported in other studies (Mower 2008; Picardi et al. 2010). The largest numbers of editing sites were found in Liriodendron and Nelumbo, with 755 and 679 editing sites, respectively. On the opposite end, two members of the rosids, Arabidopsis and Populus, had the lowest values: 349 and 313 editing sites, respectively. The average number of editing sites among monocots, rosids, and asterids showed no significant differences among them, but they were significantly lower than those of Liriodendron and Nelumbo (Fig. 1).

The number of editing sites was compared among the 24 protein-coding genes (ESM 6). The average numbers of editing sites varied greatly among genes, as observed in previous works (Cuenca et al. 2010; Giegé and Brennicke 1999; Picardi et al. 2010). While the largest average, ~ 43 editing sites, was found in the gene nad4, the smallest average, ~ 5 editing sites, was found in atp8. By grouping the genes based on their function, cytochrome C biogenesis and respiratory complex I genes exhibited the largest average numbers of editing sites, while ribosomal subunits and respiratory complex V genes showed the lowest numbers. This is consistent with other studies in which RNA editing occurred preferentially in genes encoding membrane-bound proteins and genes under strong selection (Jobson and Qiu 2008; Mower and Palmer 2006).

The average density of editing sites, expressed as the number of editing sites per 100-bp region, was calculated individually for each of the 17 × 24 protein-coding genes (Fig. 2). In total, six genes from a number of eudicots were devoid of editing sites (Fig. 2 and ESM 2). These genes were atp9, cox1, and cox2 from Populus; cox1 from Arabidopsis; atp1 from Salvia; and atp6 from Helianthus. The lack of editing sites in those genes was also confirmed by predictions made by PREP-Mt (Mower 2009). Previous studies have also recognized the lack of editing sites in the aforementioned Arabidopsis and Salvia genes (Bentolila et al. 2008; Giegé and Brennicke 1999; Wu et al. 2017). Editing sites in Populus have not been analyzed before; however, no editing sites were identified in the gene atp9 of Salix suchowensis, which is closely related to Populus (Ye et al. 2017).

Fig. 2
figure 2

Editing site densities in 24 protein-coding genes of 17 angiosperms. The numbers of editing sites per 100-bp (editing site densities) are associated to a color gradient defined from the maximum and minimum editing site densities in each column. Higher densities are depicted by darker colors, and lower ones with lighter colors. Average editing site densities and their variances are shown in the bottom rows, respectively. The rightmost column depicts the average value µ calculated from all the editing site densities within each angiosperm lineage

The analysis of the editing site densities of some genes showed remarkable variations across the 17 angiosperms (Fig. 2). For example, Liriodendron had 3.7 editing sites per 100-bp in atp6, while other angiosperms had almost none. Another stark disparity was observed between Liriodendron and Populus nad3, with 5.8 and 0.2 editing site densities, respectively. This variability was systematically analyzed by calculating the average and variance values of the editing site densities in each gene (bottom row in Fig. 2). The highest average densities were observed in cytochrome C biogenesis and Complex I genes. In particular, the gene ccmB had the highest average density value, while the gene atp1 had the smallest. These results agree with the total numbers of editing sites described above.

RNA editing at each codon position

Editing sites also showed an uneven distribution among codon positions (ESM2). Globally, 32, 56, and 12% of the 8318 identified editing sites were found at first, second, and third codon positions, respectively, in agreement with previous studies (Cuenca et al. 2010; Giegé and Brennicke 1999; Mulligan et al. 2007; Sloan et al. 2010; Picardi et al. 2010). This uneven distribution among codon positions was also observed in each angiosperm mitochondria (stacked bars in Fig. 1), and in each gene across the phylogeny with a few exceptions (e.g., the genes atp8 and cob in ESM 6). Nevertheless, as another study has observed (Cuenca et al. 2010), unexpected codon position distributions were found when the analysis was carried out in each gene of each plant species individually (ESM 7). In some cases, all editing sites were found at a single codon position (dashes in ESM 7).

Because editing sites represent a sample of the cytidines in a gene, the proportion of editing sites per codon position may be simply the result of how cytidines are distributed among codon positions. We compared the distributions of editing sites and cytidines among first and second codon positions for each gene of each plant (chi-squared test, two degrees of freedom, and significance level of 0.05). Only 70 out of 17 × 24 genes, i.e. ~ 15% of genes, were distributed significantly different with respect to cytidines (dots in ESM 7) suggesting a selective advantage given by a greater increase in hydrophobicity with editing at second codon position (Cuenca et al. 2010; Jobson and Qiu 2008). In contrast, in about 85% of the studied genes both distributions were statistically similar and the larger proportion of editing sites at second than first codon positions could be the result of random sampling from a biased distribution of cytidines between codon positions.

To study the physicochemical modifications generated by RNA editing at the protein level, amino acid changes produced by the identified editing sites were analyzed (“Amino acid changes produced by RNA editing” section). Our results are consistent with previous studies (Giegé and Brennicke 1999; Jobson and Qiu 2008; Tillich et al. 2006; Kugita et al. 2003). The majority (~ 85%) of the editing events resulted in non-synonymous codon changes (ESM 8) and mainly involved three amino acid changes: Pro → Leu, Ser → Leu, and Ser → Phe. In addition, based on the physicochemical properties of amino acids (Kyte and Doolittle 1982), codon alterations were classified according to the hydrophobicity of the resulting amino acids (ESM8). About ~ 55% of the amino acid changes were hydrophilic → hydrophobic produced by editing sites mainly at second codon positions. As shown in previous studies, RNA editing causes an overall increase in hydrophobicity of the resulting proteins (Giegé and Brennicke 1999; Mower 2008; Picardi et al. 2010; Yura and Go 2008; Yura et al. 2009).

In a few cases, RNA editing at first codon positions led to the creation of stop codons, affecting 19 CAA, 25 CGA, and three CAG codons (ESM1). About 74% of such edits were required to form proteins of a conserved length across angiosperms and were edited efficiently (i.e. most reads supported the editing event). For example, in the gene atp6 of Malus, a premature stop codon is generated through an editing site located 72 nucleotides upstream of the genomic stop codon. This editing site showed 97% of editing extent and 100% of conservation. In 10 cases, premature stop codons were created, probably as a result of erroneous editing (Schuster and Brennicke 1991). Those sites were only found edited in a single SRA experiment in most cases, and when edited, less than 20% of the reads showed the editing event. For example, a premature stop codon in the gene nad2 of Nicotiana is produced about 560 bases upstream of the genomic stop codon due to an editing site poorly conserved (i.e., only present in one out of 17 angiosperms) and with 10% of editing extent. The generation of premature stop codons may be the result of meaningless editing (Bentolila et al. 2013; Hammani and Giegé 2014; Mower and Palmer 2006; Picardi et al. 2010; Sun et al. 2016), or may have a functional role in plant mitochondria (Schuster and Brennicke 1991).

Conservation of editing sites

We set out to analyze the phylogenetic distribution of RNA editing sites in 24 mitochondrial genes of 17 diverse angiosperms. We investigated to which extent individual mitochondrial RNA editing sites are conserved among plant lineages. The conservation of editing sites across the studied angiosperms was evaluated as the proportion of edited cytidines or genomic thymidines at each position in the gene alignments (ESM 1, ESM 9). For each gene, the average conservation of editing sites was measured separately for each codon position (Fig. 3). The conservation levels varied among codon positions, with average values across all genes of ~ 70, ~ 85, and ~ 30% for first, second, and third codon positions, respectively (dashed lines in Fig. 3). Editing sites at third codon positions were poorly conserved, as expected for synonymous sites with no obvious impact in the resulting protein. Interestingly, cytochrome C biogenesis genes had conservation levels below the average values for first and second codon positions.

Fig. 3
figure 3

Average conservation of editing sites in each mitochondrial gene at first (a), second (b), and third (c) codon positions, respectively. Dashed horizontal lines represent the global average of conservation of editing sites at each codon position

Even though editing sites at first and second codon positions showed a high level of conservation across the studied plant species and for most genes, we expected even greater conservation given the apparent effect of editing at these sites. Looking in detail at the gene alignments (ESM 9), we noticed that the level of conservation is diminished because of species-specific editing sites. For example, a particular plant species was edited at a position in which all other species had a genomic C (i.e. an unedited cytidine), resulting in little editing site conservation at that site and a reduction in the amino acid conservation in that particular codon. For example, Arabidopsis, Gossypium, and Populus had a first-codon-position editing site at positions 382, 379, and 382 of the gene atp4, respectively, while this site had a genomic C in the remaining 14 angiosperms.

Low editing extent at third codon positions

The extent of RNA editing was measured for each editing site in the 24 mitochondrial genes of 17 angiosperms (ESM 10). When grouping the editing sites according to their codon positions, different levels of editing extent were observed (Fig. 4a). Note that the editing extent ranges between 10 and 100% because only editing sites with a minimum editing extent of 10% were considered. The average editing extent was around 80% for first and second codon positions, while it was around 40% for third codon positions (thick red lines in Fig. 4a). These differences across codon positions were also observed in an in-depth analysis of the transcriptome of Arabidopsis (Bentolila et al. 2013) as well as in other studies (Guo et al. 2017; Mower and Palmer 2006; Picardi et al. 2010; Stone et al. 2017; Wu et al. 2015). The average editing extent of sites at third codon positions was significantly different from those at first or second codon positions (one-sample Student’s t-test, p-values equal to 8.6 × 10e−219 and 4.1 × 10e−251 for first and second codon positions, respectively). The lower editing extent at third codon position agrees with a low conservation of editing sites at this codon position across angiosperms (previous section, Fig. 3). This suggests that editing at synonymous positions is not required for the translation of those transcripts, although this partial editing could also be relevant for the regulation of gene expression (Castandet et al. 2010; Stone et al. 2017).

Fig. 4
figure 4

Editing extent of each identified editing site per codon position. a Editing extent at each editing site according to codon positions (CP). Thick red lines show average values of editing extent at each codon position. b Average editing extent for each gene according to the codon position of the editing sites. Blue, green, and red points depict values for first, second, and third codon position, respectively. Asterisks indicate significant differences among the average editing extents at each codon position within each gene. Dashed lines indicate average values from the editing extent across genes for each codon position

The editing extent was also analyzed in each gene individually. Average editing extents were discriminated by codon position (Fig. 4b). The levels observed at first and second codon positions barely differed across all genes, while average values for third codon positions were always smaller and more variable. Interestingly, the editing extent in cytochrome C biogenesis genes at first and second position was much lower than the rest of the genes—as observed previously (Giegé and Brennicke 1999; Zehrmann et al. 2008), in agreement with a lower conservation of editing sites in these genes along the angiosperm phylogeny (Fig. 3). Also matR showed low editing efficiencies for both codon positions as reported for several other angiosperms (Mower and Palmer 2006).

Based on the 8318 editing sites in the 24 protein-coding genes conserved across the 17 mitochondrial genomes, the correlation between the editing extent and conservation level was assessed by means of Pearson’s correlation. As a result, both variables showed a strong positive correlation (rho equal to 0.65 and p-value equal to 0). When this analysis was repeated considering only editing sites in each species or each gene, the correlation estimates fluctuated along a wide range. For example, while Nicotiana showed a very strong correlation (rho equal to 0.82 and p-value 0), Cucumis showed a moderate-weak correlation (rho equal to 0.40 and p-value 0). Similarly, the correlation resulted very strong for cox1 (rho equal to 0.85 and p-value 0), while a weak correlation was detected in ccmB (rho equal to 0.33 and p-value 0).

Thymidine footprints

Editing sites are being mainly lost via silent replacements to thymidines in genomes (Cuenca et al. 2010; Mower 2008; Shields and Wolfe 1997; Sloan et al. 2010). We observed that several gene sequences showed regions containing numerous consecutive thymidines at positions of highly conserved editing sites (“Identification of thymidine footprints” section). These regions were named thymidine footprints and are likely the result of loss of editing sites through retroprocessing (Bowe and dePamphilis 1996; Cuenca et al. 2016, 2010; Hecht et al. 2011; Mulligan et al. 2007; Parkinson et al. 2005; Sloan et al. 2010). For example, the gene cox1 had long thymidine footprints in several angiosperms (Fig. 5a). In Arabidopsis and Populus the thymidine footprints spanned the whole gene as the cox1 gene is devoid of editing sites (ESM 9). Nelumbo, Zea, and Oryza exhibited long footprints (~ 1 kb long) with 24, 28, and 32 consecutive thymidines at positions of highly conserved editing sites, respectively.

Fig. 5
figure 5

a Localization of editing sites in the cox1 alignment. Editing sites at first (blue) or second (green) codon positions are depicted as vertical bars that cross horizontal lines (gene sequences). For positions in which at least one angiosperm contains an editing site, the nucleotides of other angiosperms at this site are marked with blue or green vertical bars for editing sites at first and second codon positions, respectively; with black, gray or yellow vertical bars when a T, a C or a gap is present, respectively; and no marks for A or G bases. Exons are represented by gray boxes of different intensity. b Presence of thymidine footprints in each gene. Thymidine footprints are depicted by colored squares. For each square, the number of thymidines replacing highly conserved editing sites is shown. The color of a square indicates the ratio of the footprint length and the gene/exon length. A color gradient from blue (ratio = 0) to red (ratio = 1) is used. Squares with red borders indicate footprints that span the whole gene/exon (ratio = 1). Exons without thymidine footprints in any plant species are not shown. Squares with black borders indicate intron loss by retroprocessing

A systematic analysis identified more than 150 thymidine footprints on gene alignments encompassing more than a thousand of thymidines at positions of highly conserved editing sites (Fig. 5b). We observed that all protein-coding genes except for ccmFn had at least one angiosperm showing thymidine footprints (Fig. 5b and ESM 9). Strikingly, 16 genes or exons were fully covered by thymidine footprints (cells with red borders in Fig. 5b). It is also noticeable that most footprints were found in eudicots and none in Liriodendron (Fig. 5b). Populus showed the largest number of thymidine footprints, distributed in 19 out of 24 genes. Three of them (atp9, cox1, and cox2) lack editing sites even at third codon positions, while the other 3 genes had 1 or 2 editing sites at third codon positions, which were only poorly conserved across the phylogeny. When analyzing each gene across the phylogeny, we observed that the gene cox1 exhibited thymidine footprints most frequently, followed by nad4, nad5, and nad7.

We also looked at the intron status to assess any possible correlation between the loss of editing sites (thymidine footprints) and nearby introns. We found that the gene nad4 of Helianthus lack editing sites and introns in a ~ 1 kb region that contains introns 2 and 3 in the other asterids (ESM 9). In contrast, Zea and Oryza showed no editing sites in exon 2 but retained all introns in the gene nad4.

Retroprocessing behind the loss of editing sites

Based on the identified RNA editing sites, different evolutionary hypotheses about the gain and loss of these sites were tested. This was carried on by means of non-reversible Markov models based on five nucleotide states (“Estimation of Markov models” section), in which editing sites were the fifth nucleotide state E (Mower 2008). The gain of editing sites was defined as the rate ratio ρ TE at which thymidines are substituted by editable cytidines, whereas a loss of editing was defined as the rate ratio ρ ET at which editable cytidines are substituted by thymidines. The rationale behind these definitions is that the E-to-T and T-to-E substitutions are synonymous at the protein level, and thereby theoretically free of selection (Mower 2008; Shields and Wolfe 1997). If editing sites are being lost during angiosperm evolution, the rate ratio ρ ET is expected to be greater than the rate ratio ρ TE (Cuenca et al. 2010; Mower 2008; Shields and Wolfe 1997; Sloan et al. 2010).

To test this hypothesis, we first evaluated the best way of modeling the nucleotide alignments containing editing sites by using a single Markov model for the three codon positions (M0 in ESM 11A) or three different Markov models for each codon position (M1, M2, and, M3 in ESM 11A for first, second, and third codon positions, respectively). The selection between the two models (M0 vs M1 + M2 + M3) was carried out by comparing their likelihoods with a likelihood-ratio test (LRT) with degree of freedom (df) equal to the difference in their numbers of rate ratios. The LRT indicated that the use of a Markov model for each codon position (M1, M2, and M3 in ESM 11A) significantly increased the fit to the data (df = 40; p-value equals 0). Therefore, we used these models to test the hypothesis that editing sites are lost more rapidly than gained in angiosperm evolution. In the three Markov models (M1, M2, and M3 in ESM 11A), the rate ratio ρ TE was significantly greater than the rate ratio ρ ET (df = 1; p-values of 2 × e−145, 1 × e−246, and 4 × e-−32 for first, second, and third codon position, respectively), indicating that editing sites are being lost faster than gained in the three codon positions. The significance of the comparison between the two rate ratios was assessed by estimating an alternative model for each codon position, in which the rate ratio ρ TE was constrained to be equal to the rate ratio ρ ET (M4, M5, M6 in ESM 11A for first, second, and third codon positions, respectively). The results were significant for all three codon positions.

However, these models assume that editing sites are subjected to the same mutation rate as the other sites and do not consider the effect of retroprocessing, which might be affecting the rate ratio ρ ET in particular (Mulligan et al. 2007; Sloan et al. 2010). To quantify the effect of retroprocessing in each of the 20 rate ratios, 20 new alternative Markov models were estimated for each codon position (M1a−t, M2a−t, and M3a−t in ESM 11B) based on modified alignments, in which all the thymidine footprints along with their homologous sites were removed from the initial alignments. For each codon position, each alternative model was estimated with one of the 20 rate ratios fixed with the value of the same rate ratio estimated in the initial model (M1, M2, or M3). Comparisons of these alternative models with the initial ones using LRTs showed that the rate ratio ρ ET for the first and second codon positions was significantly affected by the presence of thymidine footprints (df = 1; p-values of 3 × 10e−13 and 4 × 10e−23, respectively).

To avoid the confounding effect of retroprocessing, the difference between the editing site gain and loss rates was tested again using the alignment without thymidine footprints. For each codon position, we compared the (null) models (M7, M8, M9 in ESM 11A) with alternative ones, in which ρ TE was equal to ρ ET (M10, M11, and M12 in ESM 11A). Their comparisons revealed that, even with thymidine footprints removed, the rate ratio ρ ET was greater than ρ TE and that the two rate ratios were significantly different to each other for the three codon positions (df = 1; p-values of 4 × 10e−37, 5 × 10e−55, 5 × 10e−10, respectively). Nevertheless, the new rate ratio ρ ET showed a drastic reduction in comparison to that in the first testing that included thymidine footprints. For example, the new rate ratio ρ ET has almost reduced by half in comparison to the values obtained originally (from 32.5 to 18.5). Even though thymidine footprints were removed, it is possible that the rate ratio ρ ET may still be affected by retroprocessing owing to the large number of more lax thymidine footprints (i.e., those with less than 4 consecutive thymidines) observed in the alignments. The number of thymidine footprints containing 3 thymidines, for example, doubled in number those containing four or more thymidines. We were unable to compare the rate ratios ρ TE and ρ ET for alignments without footprints containing 3 consecutive thymidines because they were almost devoid of editing sites, preventing us from estimating further Markov models.

We also studied the evolution of editing sites by analyzing the rate ratios ρCE and ρ EC (Mower 2008). These rate ratios indicate the gain and loss of editing site recognition motifs, respectively (Cummings and Myers 2004; Farré et al. 2001; Giegé and Brennicke 1999; Mower 2008; Mulligan et al. 2007; Neuwirt et al. 2005). Using the Markov models M1, M2, and M3 as the null models, alternative models (M13, M14, M15 in ESM 11A) were estimated from the initial alignments for each codon position by constraining the rate ratios ρCE and ρ EC to be equal. The results showed that the rate ratio ρCE was not significantly different to the rate ratio ρ EC for first and second codon positions (ESM 11A). These findings indicate that editing recognition motifs are lost and gained at the same rate for editing sites at first and second codon positions.

Comparing the rate ratios estimated by the Markov models, the rates of editing site loss (ρ ET ) were greater than the loss rates of editing site recognition motif (ρ EC ) for first and second codon positions based on the alignments with or without thymidines (ESM 11A). In addition, the rates of gain of editing site recognition motif (ρCE) were much larger than the rates of editing site gain (ρ TE ) for all codon positions based on the alignments with or without thymidines (ESM 11A).

Understanding editing site gains

To learn more about the gain of editing sites during the evolution of plant mitochondria, we performed ancestral state reconstruction at each node of the angiosperm phylogeny. A total of 708 transitions to “Edited” were estimated along the evolution of the 24 mitochondrial protein-coding genes in the 17 selected angiosperms (ESM 12, Table 1). In agreement with the elevated rate ratio for C to E described above, most of the gained editing sites (92%) arose from an ancestral C. In this case, ancestral cytidines would be, later in evolution, recognized by the editosome and edited to thymidines. About 5% of the new editing sites arose from an ancestral T, in which case the edited transcript ended up unchanged through the acquisition of the edited C. Editing sites gained from ancestral A or G were the least represented (~ 3 and ~ 1% respectively). These rare gains involved either synonymous or non-synonymous editing at sites with poor amino acid conservation among angiosperms.

Table 1 Features of the different types of editing site gains inferred through ancestral state reconstruction

The gains of editing sites were further analyzed respect to their level of conservation and the effect of editing in the translated protein. The majority of the C-to-E gains (72%) were at positions in which editing was synonymous, mainly found at third codon positions. Accordingly, the gained editing sites from an ancestral C showed low editing extents and low levels of conservation (Table 1). For example, Phoenix nad3 gained a third-codon position editing site (CDS position 225 in ESM1) with an editing extent and conservation of 12 and 6%, respectively. In contrast, most T-to-E gains (76%) represented non-synonymous editing that resulted in an evolutionarily conserved amino acid. These gains were in general edited very efficiently and those sites were extremely conserved across the angiosperm phylogeny (Table 1). Finally, we examined the nucleotides flanking the gained editing sites for the presence of conserved motifs involved in editing site recognition. Previous studies observed that editing sites in plant mitochondria are often preceded by pyrimidines and in some cases followed by purines (Cummings and Myers 2004; Giegé and Brennicke 1999; Mulligan et al. 2007). We found that 86% of the gained editing sites had pyrimidines preceding the editing site, while 42% showed the full YCR motif. These values are slightly lower than those for all the identified editing sites, in which, 93 and 55% exhibited the YCN and YCR motif, respectively. Interestingly, gained editing sites at second codon positions showed the lowest frequency of the YCN motif (67%), while 86 and 90% of those at first and third codon positions had the YCN motif.

Discussion

To achieve a comprehensive picture of mitochondrial C-to-U RNA editing sites, 17 mitochondrial genomes from diverse angiosperms were analyzed. By comparing the mitochondrial protein-coding genes with transcriptomes available as RNA-seq data in public databases, 10,217 editing sites, or C-to-U changes, were identified. Although other types of nucleotide mismatches were also found, these unconventional mismatches represented less than 7% in comparison to the number of C-to-U mismatches. Many of the unconventional mismatches were the result of errors in the annotation of splicing sites, or due to the dual transcription of mitochondrial and nuclear copies of mitochondrial genes. No strong evidence was found to support non-canonical RNA editing in the studied angiosperm mitochondria.

For the comparative and evolutionary analyses, we focused on 8318 editing sites located in the 24 protein-coding genes that are present in all 17 angiosperm mitochondrial genomes. C-to-U RNA editing sites were unevenly distributed across the 17 plant species. Mitochondrial genes in the magnoliid Liriodendron and the basal eudicot Nelumbo were the most heavily edited in stark contrast to those in rosids. Also, mitochondrial protein-coding genes showed different frequency of C-to-U mismatches. Genes involved in cytochrome C biogenesis and in the respiratory complex I exhibited the highest editing densities, while ribosomal subunits and respiratory complex V genes showed the lowest editing frequencies. Although some protein-coding genes had consistent numbers of editing sites across the phylogeny, others showed differences (e.g., atp9, nad3, and nad4). These results contrast with previous observations based on a limited number of taxa, in which the frequency of editing sites across species appeared to be fairly consistent and homogeneous (Mower and Palmer 2006). In agreement with other studies (Cuenca et al. 2010; Giegé and Brennicke 1999; Mulligan et al. 2007; Sloan et al. 2010; Picardi et al. 2010; Yura and Go 2008), the identified C-to-U RNA editing sites were unevenly distributed among the three codon positions, being preferably located at second followed by first codon positions.

RNA editing at first and second codon positions is mostly non-synonymous and tends to increase amino acid conservation. Editing sites at second codon positions, and to a lesser extent at first codon positions, were highly conserved across the studied plant species and were efficiently edited. Moreover, non-synonymous editing sites led to amino acid changes that increased the level of hydrophobicity in protein products, which is consistent with previous observations (Cuenca et al. 2010; Covello and Gray 1993; Giegé and Brennicke 1999; Jobson and Qiu 2008). The high conservation and editing efficiency we observed at editing sites at first and second codon positions highlight the essential role of non-synonymous editing sites for the proper function of mitochondrial-encoded proteins. All together, these results suggest that natural selection seems to be maintaining non-synonymous editing in mitochondrial genomes (Cuenca et al. 2010; Jobson and Qiu 2008; Mower 2008).

In contrast to first and second codon positions, editing sites at third codon positions were relatively few (12%) and led to synonymous codon changes. They were poorly conserved and were only partially edited exhibiting the lowest editing extents (40% in average). These observations suggest that they are not directly maintained by natural selection, as observed previously (Bentolila et al. 2008, 2013; Guo et al. 2017; Mower and Palmer 2006; Picardi et al. 2010; Stone et al. 2017; Wu et al. 2015; Zehrmann et al. 2008; Sloan et al. 2010). Furthermore, it has been shown that the editing extent at synonymous positions varies more often among individuals, tissue types or developmental stages than at non-synonymous sites (Bentolila et al. 2008; Zehrmann et al. 2008; Sloan et al. 2010). So, why are third codon editing sites edited? They are likely edited accidentally by the RNA editing machinery or editosome (Bentolila et al. 2013; Hammani and Giegé 2014; Mower and Palmer 2006; Picardi et al. 2010; Sun et al. 2016). To perform editing, the editosome specifically binds messenger RNAs through molecular signals flanking editing sites (Cummings and Myers 2004; Farré et al. 2001; Giegé and Brennicke 1999; Mulligan et al. 2007; Neuwirt et al. 2005). Third codon position editing sites could be surrounded by degenerated signals which are partially recognized by the editosome (Bentolila et al. 2013; Picardi et al. 2010; Takenaka et al. 2013). Alternatively, synonymous edits could be playing different roles in the expression of mitochondrial genes (Sloan and Taylor 2010). Previous research has shown that, in some cases, such editing sites affect the formation and stability of secondary structures in transcripts (Kindgren et al. 2015; Stone et al. 2017), which is crucial to ensure splicing in some genes (Castandet et al. 2010). Our data agree with these observations, showing 23 editing sites at third codon positions with conservation and editing extent greater than 90%. For example, the editing site at position 399 in Daucus ccmC showed an editing extent of 92% and a conservation of 94%. A detailed examination revealed the presence of other editing sites within a 3 nucleotide window surrounding 7 of the 23 editing sites, suggesting that their successful editing could be required for the RNA editing of the nearby, non-synonymous, editing sites (Kindgren et al. 2015). .

Our results showed that early-diverging lineages had a significant tendency towards having more editing sites in their mitochondrial genes than late-diverging ones, as observed in other studies (Cuenca et al. 2010; Mower 2008; Sloan et al. 2010). We found that the loss of editing sites along angiosperm evolution is mainly occurring by replacing editing sites with thymidines, instead of a degradation of the editing recognition motif around editing sites. The loss rates of editing sites (E to T) was greater than the loss of editing site recognition (E to C) in our evolutionary analyses. This was expected because the replacement of editing sites by genomic thymidines does not alter the encoded amino acid. This replacement could take place by C-to-T point-mutations or by retroprocessing (Mower 2008; Shields and Wolfe 1997; Sloan et al. 2010; Grewe et al. 2010; Parkinson et al. 2005; Ran et al. 2010; Grewe et al. 2014). Under the model of retroprocessing, the reverse transcription of edited mRNAs produces cDNA molecules, in which editing sites are replaced by thymidines. When these cDNAs undergo homologous recombination with the mitochondrial genome, consecutive adjacent editing sites are lost simultaneously. Although in vivo evidence is yet lacking in angiosperm mitochondria, abundant indirect evidence of the acting of retroprocessing has accumulated (Bégu et al. 1998; Grewe et al. 2010; Farré and Araya 1999; Odom and Herrin 2013; Ran et al. 2010; Wahleithner et al. 1990). The role of retroprocessing is apparent when long sequence stretches lost consecutive editing sites (thymidine footprints) and even introns (Cuenca et al. 2010, 2016; Grewe et al. 2010; Ran et al. 2010).

A systematic analysis revealed the presence of a number of thymidine footprints in mitochondrial genes with a minimum of four consecutive thymidines in positions of highly conserved editing sites. If retroprocessing was indeed responsible for the thymidine footprints, our results indicate that retroprocessing is a widespread phenomenon in rosids, moderate in monocots, and almost absent in the magnoliid Liriodendron and the basal eudicot Nelumbo. The spatial localization of the identified thymidine footprints on genes revealed that, even though in a few cases all editing sites were lost from a gene (e.g. cox1), retroprocessing seemed to act locally, eliminating editing sites from different internal regions of a gene (Cuenca et al. 2016; Geiss et al. 1994; Grewe et al. 2009; Itchoda et al. 2002; Lopez et al. 2007; Ran et al. 2010; Sloan et al. 2010). Different models have been proposed to explain this localized effect of retroprocessing, each entailing different predictions regarding the frequency and location of intron and editing site losses along genes (Cuenca et al. 2016; Zhang et al. 2010). One is the reverse transcription of variable-length cDNA molecules from completely or partially edited transcripts. Another is the presence of full-length cDNA molecules, which recombine only partially. Recent evidence suggests that microconversion could be responsible for the localized loss of edited sites (Cuenca et al. 2010; Hao and Palmer 2009). We agree with previous studies that retroprocessing is an important mechanism behind editing and intron loss in mitochondrial genes of plants (Cuenca et al. 2016; Mower 2008; Sloan et al. 2010). Here, we identified the loss of introns in the gene nad4 of Helianthus and did not detect a bias in the location of thymidine footprints on genes. Our observations agree with a retroprocessing model, in which reverse-transcribed transcripts recombine with the genome but gene conversion takes place most frequently in interior regions of the genes rather than at the ends (Zhang et al. 2010). Previous reports showed that the frequency of recombination varies along genomic regions and that some genes or gene regions may recombine preferentially (Lawson et al. 2010; Paigen and Petkov 2010). Similarly, we observed a preferential loss of editing sites in the gene cox1 across the angiosperm phylogeny.

To further evaluate the retroprocessing role in editing site losses, statistical tests compared evolutionary hypotheses based on data sets including and excluding the thymidine footprints. These tests showed a significant decrease in the rate ratio E-to-T for first and second codon positions when thymidine footprints were excluded, while all other nucleotide changes remained the same. These results provide evidence that retroprocessing is, at least partly, responsible for the E to T changes. If retroprocessing indeed played a major role in the loss of editing sites, then the role of natural selection in purging RNA editing sites through independent point mutations during angiosperm mitochondrial evolution (Mower 2008) should be reconsidered. Nevertheless, further work will be necessary to gather additional evidence to test the hypothesis of loss of editing sites by retroprocessing.

The gain and loss rates of editing site recognition (C to E and E to C, respectively) were also estimated and compared. Our analyses showed that the rates of acquisition and degradation of editing site recognition motifs were relatively elevated and it was 5–15 times higher in the third than in first and second codon positions. These observations agree with a previous study (Mower 2008) and may relate to the dynamic nature of the editing process and relaxed specificity of the editing machinery (Bentolila et al. 2013; Takenaka et al. 2012). Many of those sites that are variably recognized by the editosome along angiosperm evolution may not represent real acquisitions or degradation of editing motifs. Instead, they may be the result of the inconsistency of the editing process within species and tissues, particularly at synonymous sites. This hypothesis is supported by the fact that the rate ratios C-to-E and E-to-C are greater in editing sites at third codon positions, which are synonymous and show overall a low editing extent and are poorly conserved.

By ancestral state reconstruction, we examined the gain of editing sites and found that the majority of the gains arose from ancestral Cs and in a much lesser extent from ancestral Ts. This result was unexpected because a C to E gain results in a different nucleotide in the transcript, which may lead to a change in protein sequence. In contrast a T to E gain results in an identical transcript and protein sequence. A more detailed analysis revealed that the gains from ancestral Cs were mostly synonymous (72%). Also, these C to E gains showed low average conservation (24%) and low average editing extent (37%). These observations indicate that cytidines that are not under strong natural selection are being edited erratically, as previously observed for editing at third codon positions. It is also possible that putative gains of editing recognition in C to E gains may be an artifact resulting from the natural variability of editing efficiency at those sites with predominant synonymous editing. Then, if cytidines are poorly edited (< 10% editing extent; i.e. not considered editing sites) in several species while a C at the same alignment position shows an editing extent bearly greater than 10% in a single species, it will appear as a C-to-E gain. For example, one site of the gene cox3 (alignment position 603) was identified as an editing site in Glycine and Populus with 12–18% of editing extent, while it was considered a non-edited C in 11 angiosperms that showed 1–6% of editing extent. The opposite is true for T to E gains, in which most of the editing is non-synonymous (76%) and these editing sites are efficiently edited (75%) and highly conserved (97%).

Materials and methods

Identification of RNA editing sites from transcriptomic data

To identify editing sites, a custom computational pipeline was coded to take two inputs for each plant species: a reference file with mitochondrial protein-coding genes and a file with transcriptomic data. The reference file consisted in a multi-FASTA file containing the CDS of protein-coding genes. These sequences were extracted from complete mitochondrial genomes available in the GenBank organellar genome database (Accession Numbers in ESM 13). To increase the read coverage at the end of the CDS, their 5′ and 3′ ends were extended with 100-bp flanking regions, which were discarded after read mapping. The transcriptomic data consisted in a SAM file, which contained reads aligned on the reference gene. These reads were obtained from RNA-seq data in the SRA database at NCBI (SRA accessions in ESM 13). Based on these two inputs, editing sites were identified by counting the number of A, C, G, and U nucleotides in reads aligned at each position of the reference gene. The editing extent, #U/(#A+#C+#G+#U), was calculated at each C-containing reference position. If a C-containing reference position showed an editing extent equal to or greater than 10%, it was defined as an editing site. The same procedure was used to identify unconventional mismatches between mitochondrial genes and transcriptomes. The threshold of 10% was chosen because it led to a high number of editing sites conserved across all the studied plant species (ESM 14).

The SRA accessions were chosen to match the same plant species of the reference gene; however, different individuals or even different ecotypes had to be used. For the genus Liriodendron, of particular interest given its evolutionary position, different species were used because a single mitochondrial genome (from L. tulipifera) and a single transcriptomic study (from L. chinense) were available. We expected a low number of disagreements between these species due to the extremely low mutation rate in the mitochondrial DNA of Liriodendron (Richardson et al. 2013). Also, it was expected that C to T polymorphisms would show 100% Ts in the mapped reads. Only 6 out of 991 editing sites identified in Liriodendron showed editing extent equal to 100% (ESM1), and those were highly conserved editing sites across the angiosperm phylogeny. In addition, the identification of about 90% of the editing sites at first and second codon positions in Liriodendron was supported by predictions made by PREP-Mt (Mower 2009) under a cutoff of 0.2. Thus, the putative misidentification of editing sites in Liriodendron due to the use of two different species is negligible.

To reliably identify editing sites, only SRA accessions with paired-end reads were selected, which increased the specificity of the read mapping, reducing the number of false positives. Whenever it was possible, RNA-seq data obtained from leaves or roots of wild type individuals were preferred, avoiding those from mutant or virus-infected organisms. About 70% of the 99 SRA accessions were from root or leaf tissues (ESM 13). Because the SRA database included a variable amount of SRA accessions for each of the 17 selected angiosperm species, a number of SRA accessions was chosen so that 80% of the C-containing positions in each mitochondrial gene had at least 10 aligned reads. To minimize the number of false negatives in editing site identification, any C-containing position with a read depth < 10× was annotated as missing data (a gap) in the gene sequences. If a single SRA accession met the listed criteria, an additional SRA accession was selected from a different individual to reduce any potential bias in the transcriptome. On average, ~ 6 SRA accessions per species were chosen, resulting in a total of 99 SRA accessions (ESM13). Considering all the selected SRA accessions, the average read coverage was calculated for all the C-containing positions in a gene in each species (ESM 15). The minimum average read coverage was ~ 12 reads for Glycine nad6, whereas the maximum read coverage was of ~ 900 reads for Gossypium rps12.

The SAM file of each gene was generated using Magic-BLAST v1.1.0 (https://ncbi.github.io/magicblast). Magic-BLAST was used instead of other read aligners because it runs remotely on NCBI servers. Thus, SAM files were created without downloading RNA-seq files into local computers. To guarantee mapping reads from mature mRNAs, avoiding partially edited transcripts, penalty of − 4, gap open of 50, and gap extend of 50 were set. To discard unreliable mapped reads, duplicated and unpaired reads were filtered out from each SAM file using SAMtools v0.1Footnote 4 When more than one SAM file was associated to a reference file, they were merged together into a single SAM file with Picard tools v2.9.0Footnote 5. Then, each SAM file was converted into a BAM file and sorted with SAMtools. With a sorted BAM file, bam-readcount v0.8.0 was employed to count the number of A, C, G, and U nucleotides at each position.

Amino acid changes produced by RNA editing

To analyze amino acid changes produced by editing sites, codons containing editing sites were extracted from each protein-coding gene sequence of each studied plant species. Each codon with more than one editing sites was copied as many times as its number of editing sites. Editing sites in codon copies were substituted by unedited cytidines so that each copy had a single editing site from the original codon. Codons before and after editing were translated into amino acids according to the standard genetic code and the resulting amino acids of both groups were compared to identify physicochemical changes.

Measuring conservation of editing sites

Mitochondrial protein-coding genes were automatically aligned with Mafft v7.305 (Katoh and Standley 2013), manually adjusted with AliView v1.18 (Larsson 2014), and their unalignable 5′ and 3′ ends were trimmed (ESM3). Given a gene alignment, a position (column) was eliminated unless one plant species contained an editing site at that position. The conservation of each editing site was measured by calculating a ratio between the number of editing sites or genomic thymidines (because in both cases the translated amino acid would be the same), and the total number of nucleotides in that column.

Identification of thymidine footprints

Thymidine footprints were identified from each gene alignment considering only the editing sites at first or second codon positions. Given a gene alignment, candidate regions were screened from each gene sequence. A candidate region was any region between two consecutive editing sites. For each candidate region, the number of thymidines at positions in which other species had an editing site was counted. Those regions containing > 3 thymidines at positions of highly conserved editing sites were defined as thymidine footprints.

Estimation of markov models

All nucleotide alignments were concatenated into a single alignment and the editing sites were annotated as a fifth nucleotide state E (Mower 2008). From the concatenated alignment, three additional alignments were obtained by partitioning positions based on their codon positions. The gain and loss of editing sites can be modeled by non-reversible Markov models (Mower 2008). Given five nucleotide states (A, C, G, T, and E), a Markov model represents the probability to go from any nucleotide state i to any other nucleotide state j over an infinitesimal period of time t. Formally, it defines a probability distribution P(t) = exp(Qt), which is entirely determined by the 5 × 5 matrix Q called the instantaneous rate matrix. Each cell, or instantaneous rate, is the product of two parameters: ρ ij π j , the rate ratio ρ ij , the rate at which state i is substituted by state j, and π j , the frequency at which state j occurs. The alignments were used to estimate instantaneous rate matrices with BayesTraits v3.0Footnote 6. An MLTries equal to 100 and ScaleTrees equal to 1 were used. For all the Markov models that were estimated, the angiosperm phylogeny (Fig. 1) was used as tree-based hypothesis, whose branch lengths were optimized by IQ-TREE (Nguyen et al. 2014) with an UNREST evolutionary model using the concatenated nucleotide alignment with editing sites annotated as cytidines.

The ancestral states of RNA editing at each node of the angiosperm phylogeny were estimated with the method “Most Recent Common Ancestor” implemented in Bayestraits v3.0.1 using a Bayesian Markov-chain Monte Carlo (MCMC) approach. A Bayes Multistate model with five possible states (A, C, G, T, and E) was conducted for 150,000 iterations, with the first 50,000 discarded as burn-in, and sampled every 500 cycles over the topology of our phylogeny. A hyperprior with a gamma prior (seeded from a uniform distribution on the interval 0–100) was used to reduce uncertainty and arbitrariness of choosing priors in the MCMC analysis. The mean value of the posterior probabilities for each site was used to reconstruct the ancestral states. The estimated gains of editing sites along the evolution were obtained and further analyzed (ESM 12).