Introduction

The Malpighiales are one of the largest and most diverse orders of flowering plants, containing about 8% of all eudicots and 6% of all angiosperms (Davis et al. 2005). In an expanded circumscription the order currently comprises 38 families (APG 2003; Barkman et al. 2004) and nearly 16,000 species (information taken from the Angiosperm Phylogeny Website, Stevens 2001 onwards). The order contains some well known families, such as Euphorbiaceae (spurges), Passifloraceae (passion fruits), Linaceae (flaxes), Salicaceae (poplars and willows), and Violaceae (violets). Many of the families are distributed in the tropics where they constitute an important element of the understory of tropical rain forests (Davis et al. 2005).

The first molecular study across angiosperms based on sequences of the plastid gene rbcL (Chase et al. 1993) already depicted a lineage of Chrysobalanaceae, Erythroxylaceae, Violaceae, Ochnaceae, Euphorbiaceae, Humiriaceae, Passifloraceae and Malpighiaceae within a rosid clade. Close relationships of these families had not been considered in pre-cladistic classification systems, e.g., those of Cronquist (1981) or Takhtajan (1997). The addition of morphological characters to the rbcL matrix (Nandi et al. 1998) also recovered this new clade and suggested morphological features such as a fibrous exotegmen, dry stigmas, trilacunar nodes and toothed leaf margin as possible synapomorphies. Subsequent analyses combining rbcL and atpB (Savolainen et al. 2000a) or rbcL, atpB and 18S rDNA sequences (Soltis et al. 2000) yielded 92% bootstrap (BS) and 100% jackknife (JK) support for the Malpighiales, respectively. The highest support from a single gene was obtained by the phylogenetic analysis of angiosperms of Hilu et al. (2003) based on partial sequences of the rapidly evolving plastid gene matK.

Some major clades within Malpighiales have been identified so far, e.g., a clade uniting Elatinaceae and Malpighiaceae (Davis and Chase 2004), the clade of Ochnaceae, Quiinaceae and Medusagynaceae (Fay et al. 1997) or the grouping of Clusiaceae, Hypericaceae, Bonnetiaceae and Podostemaceae (Davis et al. 2005). Some of these families were merged into broadly defined families by APG II (2003), as for example Ochnaceae s.l. (including Medusagynaceae and Quiinaceae). Other families such as Flacourtiaceae were split up and partly transferred to Salicaceae, a family that now contains about 1,000 species (Chase et al. 2002). Euphorbiaceae s.l. are now viewed as several independent lineages (Euphorbiaceae, Phyllanthaceae, Picrodendraceae and Putranjivaceae (APG 1998; Savolainen et al. 2000b; Wurdack et al. 2005).

But even with large sets of data and the use of three (Soltis et al. 2000) or four genes (Davis et al. 2005) from all three plant genomes, the phylogeny of Malpighiales could not be resolved. The most recent study on Malpighiales (Tokuoka and Tobe 2006) combined sequences of rbcL, atpB, 18S rDNA, and matK and yielded the best phylogenetic hypotheses of Malpighiales so far. Nevertheless, Malpighiales still remain the phylogenetically least understood angiosperm order.

Davis et al. (2005) provide evidence that the diversity in Malpighiales is the result of a rapid radiation that began in tropical rain forests in the late Aptian (114 mya), and that most lineages began to diversify shortly thereafter, with the Hypericaceae–Podostemaceae clade appearing as the youngest during the Campanian (76 mya). A relatively fast diversification into major lineages may serve as an explanation for the difficulty of resolving deep nodes in Malpighiales. Finding sequence characters that have changed at a sufficiently high rate to accumulate mutations between fast lineage branching events, and at the same time have not changed so fast that phylogenetic signal was obscured, appears as a solution. Introns are a promising tool since they are mosaics of conserved and variable elements and provide a greater range of variable sites evolving under different constraints (Kelchner 2002). Group II introns with their overall conserved secondary and tertiary structure and well characterized domains are especially suited for studying phylogenetic information content with respect to structure, function and molecular evolution of genomic regions.

The effectiveness of rapidly evolving and non-coding chloroplast regions as markers for deep nodes in angiosperms has already been demonstrated. For basal angiosperms, Borsch et al. (2003) sequenced the trnT–F region from the chloroplast genome consisting of two spacers and a group I intron, and Löhne et al. (2005) generated a dataset of sequences of the petD group II intron and the petB–petD spacer. The resulting trees in both studies were highly resolved and well supported and congruent with the multigene and multigenome studies comprising a manifold higher number of sequenced nucleotides (Qiu et al. 2000; Zanis et al. 2002). Combined analyses of the rapidly evolving chloroplast regions matK, trnT-F, and petD for early branching angiosperms (Borsch et al. 2005) and for early branching eudicots (Worberg et al. 2007) showed that confidence into phylogenetic hypotheses still can be improved by including more sequence data from introns and spacers. Müller et al. (2006) have shown that the amount of informative sites as well as phylogenetic signal per informative character is higher in matK and trnT-F as compared to the slowly evolving rbcL using a character resampling and statistical analysis pipe.

This study is part of an ongoing project to evaluate mutational dynamics of rapidly evolving and non-coding chloroplast DNA and their phylogenetic utility in eudicots. Aims of this study were first to generate a dataset of sequences of the petB–petD region for a representative taxon set of Malpighiales, and second to examine their alignability and potential for inferring relationships in a difficult to resolve clade. The third major aim was to evaluate the effects of microstructural mutations on the evolution of the different intron domains.

Materials and methods

Taxon sampling

The data set comprises 64 taxa from Malpighiales and eight representatives from Celastrales and Oxalidales as outgroup. All families of the order recognized by APG II (2003) are included except Bonnetiaceae, Euphroniaceae, Goupiaceae, Lophopyxidaceae and Putranjivaceae for which no material was available. For large families such as Euphorbiaceae or Salicaceae we selected representatives of major clades as retrieved in published phylogenetic analyses of these families. Most of the plants sampled were obtained from the living collection at the Botanical Gardens Bonn. A list of all sampled taxa, their origin and voucher information is given in Table 1.

Table 1 Taxa used in this study, origin of the plant material, voucher information, herbarium acronyms, and GenBank accession numbers

Isolation of genomic DNA

Genomic DNA was isolated from silica-dried leaves or herbarium specimens following the modified CTAB extraction method with triple extractions described by Borsch et al. (2003). Fresh leaves were generally dried in silica gel before extraction. Dry tissue was ground to a fine powder using a mechanical homogenizer (Retsch MM200) with 5 mm beads at 30 Hz for 2 min. DNA from Malesherbia ardens, Dichapetalum mossambicense, Chrysobalanus icaco, Picrodendron baccatum, Touroulia guianensis, Quiina integrifolia, Bergia suffruticosa, Ctenolophon englerianus, Phyllocosmus lemaireanus, and Microdesmis puberula was isolated using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany).

Amplification and sequencing

The amplified fragment consisted of the petB–petD intergenic spacer, the petD-5′-exon and the petD intron. For practical reasons the petB–petD spacer was co-amplified using the universal forward primer pipetB1411F and the reverse primer pipetD738R designed by Löhne and Borsch (2005). Additional internal sequencing primers (OpetD897R: 5′-RATCCCTTSTTTCACTCCGATAG-3′; LIpetD878R: 5′-TGTAGTCATTTCCTCTGCATCGAC-3′; LAMpetD951R: 5′-CATACAAAGRATTTACTTGTTAC-3′; and SALpetD599F: 5′-GCAGGCTCCGTAAAATCCAGTA-3′) were designed in this study for specific groups of taxa because of pherograms not being readable downstream of long mononucleotide stretches.

PCR conditions followed Löhne and Borsch (2005). Reactions were performed in a T3 thermocycler (Biometra, Göttingen, Germany). In some cases where DNA had been isolated from herbarium specimens the universal primers were used in combination with the internal primers OpetD897R and SALpetD599F to amplify the petD region in two overlapping halves. Fragments were visualized using the Flu-o-blu system (Biozym, Hamburg, Germany) and excised from the gel. The DNA was then purified using the QIAquick Gel Extraction Kit (Qiagen, Hilden, Germany) according to the manufacturer’s protocol. PCR products were directly sequenced using the DCTS Quick Start Kit (Beckman Coulter). The reaction mix contained 3 μl DCTS Quick Start Kit (Beckman Coulter), 0.5 μl primer (20 pm/μl), 0.5–6.5 μl DNA template and ultrapure water to obtain a total volume of 10 μl. The cycle sequencing temperature profile consisted of 30 cycles of 96°C for 020 min, 50°C for 020 min, 60°C for 0400 min, on a T3 thermocycler (Biometra, Göttingen, Germany). Samples were run on an automated capillary sequencer (CEQ 8000 Genetic Analysis System, Beckman Coulter). Pherograms were edited using the software PhyDE v0.97 (http://www.phyde.de).

Sequence alignment

Chloroplast introns and spacers exhibit a high number of microstructural mutations apart from substitutions. For correct primary homology assessment, the respective mutational events need to be identified and gaps have to be placed accordingly (e.g., Kelchner (2000)). The main alignment principle was therefore to search for sequence motifs, not overall sequence similarity. Sequences were aligned manually, using the alignment editor PhyDE v. 097 (http://www.phyde.de). The rules for manual alignment of non-coding chloroplast regions proposed by Löhne and Borsch (2005) were also followed here. Single-base indels that were identified during alignment were checked in the original pherograms to make sure that they were not reading errors. Mutational hotspots with uncertain homology assessment (Borsch et al. 2003) were excluded from phylogenetic analysis. The alignment is available from the corresponding author on request.

Sequence statistics and coding of length mutational events

The length ranges of the spacer and the structural partitions of the intron as well as GC content, transition/transversion ratio, and the number of informative and variable positions were calculated using SeqState v. 1.25 (Müller 2005b). Length mutations were coded according to the Simple Indel Coding method (Simmons and Ochoterena 2000) using the Indel Coder option in SeqState v. 1.25 and analysed in combination with the sequence data matrix.

Phylogenetic analysis

Parsimony tree search

All aligned positions were given equal weight and gaps were treated as missing data. The search for the shortest tree was performed using the parsimony ratchet approach using the software PRAP (Müller 2004). Ratchet settings for this study were 200 iterations with 25% of the positions randomly upweighted (weight = 2) during each replicate and 10 random addition cycles. The matrix was run using only substitution information and then combined with the indel matrix. The number of steps for each tree and the consistency, retention, and rescaled consistency indices (CI, RI, and RC) were calculated by PAUP* v. 4.0b10 (Swofford 1998). Jackknifing was used to evaluate branch support. Jackknife parameters were chosen according to the optimal evaluation strategies described by (Müller 2005a). A total number of 10,000 jackknife replicates was performed using the TBR branch swapping algorithm with 36.788% of characters deleted in each replicate. One tree was held during each replicate.

Bayesian Inference

Bayesian Inference (BI) was performed using MrBayes 3.1 (Huelsenbeck and Ronquist 2001). Nucleotide substitution models for the dataset were evaluated using Modeltest 3.7 (Posada and Crandall 1998) with spacer and intron sequences analysed separately. The hierarchical likelihood ratio test (hLRT) suggested the GTR + I + Γ model as the best for both regions and, therefore, Bayesian analysis was run with the implementation of this model. Two separate BI analyses were run: one only with sequence data and another using sequence data combined with the indel matrix. For the latter, the dataset was partitioned into DNA and binary characters, the GTR + I + Γ model was employed for the sequences and the restriction model for the indel matrix.

Four simultaneous runs of Metropolis-coupled Markov Chain Monte Carlo (MCMCMC) analyses each with four parallel chains were performed for 1 million generations, saving one tree every 100th generation, starting with a random tree. Other MCMC parameters were left with the program’s default settings. Likelihood values appeared stationary after 25,000 generations. From the 10,000 trees saved, the first 250 were discarded. The remaining trees were summarized in a majority rule consensus tree. All trees were drawn with TreeGraph v. 1.10 (Müller and Müller 2004).

Inference of RNA secondary structure

The complete intron structure was calculated from the sequence of Idesia polycarpa (Salicaceae). Idesia has a mid-sized intron where no large indels were observed and the extension of sequences in hotspots was moderate, and thus seemed a suitable model for Malpighiales. Apart from Idesia, structures of subdomain D2 of domain I and entire domains II–VI were calculated for additional taxa with deviating sequences. Secondary structures were determined using RNAstructure 4.3 (Mathews et al. 1996–2006). The respective algorithm is described in Mathews et al. (2004). Currently available algorithms on RNA secondary structure are not able to predict the structure of an entire group II intron (see Mathews et al. (2006) for discussion). Therefore, domains and subdomains of the intron were first identified by comparison with the annotated alignment of petD intron sequences from maize, tobacco, spinach and Marchantia provided by Michel et al. (1989). Since the borders of structural partitions appear to be conserved, they could easily be identified. Then, secondary structures were individually calculated for each domain. Domain I had to be folded separately by each subdomain due to its large size. The DNA sequences were folded as RNA (allowing U–G pairing). Constraints for the two exon binding sites and the single stranded branch point A were defined. In cases where alternative foldings varying only slightly in their free energy were possible the choice of structures for illustration was based on both, free energy and comparison with the already known group II intron structures (Michel and Dujon 1983; Michel et al. 1989). Structures of each domain were later assembled using the software RNAViz 2.0 (De Rijk et al. 2003) to draw the entire intron.

Results

Sequence characteristics of the petB–petD region

The length of the entire fragment consisting of the petB–petD intergenic spacer, the petD 5’ exon and the petD intron ranged from 912 to 1,094 nt in the taxa studied. No substitutions occurred in the petD 5′-exon. The final matrix (only spacer and intron) contained 1548 characters after the exclusion of hotspots and the petD 5′-exon. Positions excluded as hotspots in individual sequences are given in the “Appendix 1” (Table 3). The characteristics of the petB–petD-region, such as sequence length, GC-content, Ti/Tv-ratios, and the numbers of variable and informative characters are given in Table 2. A comparison of average GC content of the six intron domains revealed remarkable differences between them (Table 2). Domain I has a GC content slightly higher than domain II but lower than in domain III, although domain I is nearly as large as the other five domains together. The highest GC content is observed in domains III, V, and VI, which all are small.

Table 2 Characteristics of the petB-petD spacer and petD intron sequences in Malpighiales

Length variation in the petB–petD spacer was comparatively low. The shortest spacer was found in Phyllanthus fluitans (182 nt) and the longest in Tristichia trifaria (245 nt). Apart from larger indels of 5–10 nt that accounted for most of the length variability in the spacer, single nucleotide indels were frequent. Five hotspots in the spacer were excluded from the phylogenetic analyses. The first (H1) was the part at the beginning of the spacer, where several indels occurred, for which a sequence motif and a probable origin could not be determined. To avoid artifacts in the indel matrix, this part was excluded from analyses. The second (H2) hotspot was a poly-G stretch of 2–7 G’s. The third hotspot (H3) was basically a poly-A stretch of 7–20 nt (containing individual substitutions). The largest hotspot (H4) was 10–54 nt long and an AT-rich satellite-like region. The fifth hotspot (H5) was again a poly-A stretch of 9–15 nt.

The petD intron was shortest in Brunellia mexicana (713 nt) and longest in Malpighia glabra (970 nt). This length variability is mainly due to frequent microstructural changes in two large hotspots in the intron (see below). After exclusion of all hotspots, the number of base characters from the intron ranged from 573 to 673 in the matrix.

Secondary structure of the petD intron

The proposed secondary structure of the petD intron in Idesia polycarpa is shown in Fig. 1. Domain I is connected to the central core by a helical element of 20–24 nt. Domain I comprises the largest part of the intron, varying in length from 369 nt in Brunellia mexicana to 553 nt in Malpighia glabra. Subdomains A, B, and C are small stem-loop structures connected to each other by few interhelical nucleotides. A large helical element (D1), interrupted by several small bulges is the connecting part to subdomains D2 and D3 and forms the stem of the entire subdomain. Subdomain D2 is a large stem-loop element located between subdomain D3 containing the exon binding site 1 (EBS 1) and EBS 2. This stem-loop element corresponds to hotspot H6 and accounts for a large amount of the length variation in the petD intron (Fig. 2). An alignment of the respective sequence parts is only feasible among closely related taxa within some of the families like Salicaceae, Ochnaceae–Quiinaceae, or Rhizophoraceae. Domain II and domain III are small stem-loop structures (Figs. 3, 4) separated by 10–13 interhelical nucleotides depending on the individual taxon. Domain II was approximately 70-nt long in most taxa without major variation between outgroups and Malpighiales. A small poly-T was excluded from the analyses as hotspot H7. Domain III was conserved in its length (Table 2). Short indels of 4–8 nt were present but not frequent and the domain was unambiguously alignable without exclusion of hotspots. Three interhelical nucleotides (ADT) separate domain III from domain IV. Domain IV is the second largest domain and another highly variable element of the intron. The helix that comprises the stem of the domain is often only 4-nt long but substitutions can occur that lead to a larger interhelical part between domain III and IV. Domain IV (Fig. 5) was the most variable domain in terms of length, sequence and structural variability. Two hotspots (H8, H9) make up more than half of the domain and are composed of AT-rich elements and poly-A or poly-T stretches. Figure 6 depicts the secondary structure of the inferred inversion in Djinga. Unlike other inversions known (Kelchner and Wendel 1996) it is not associated with a hairpin. Domain IV and V are connected by usually only 1 nt. The structure of domain V (Fig. 7) reflects the conserved scheme known from other group II introns (Lehmann and Schmidt 2003; Michel and Dujon 1983; Pyle et al. 2007). Most parts of it are double-stranded with the exception of the bulge consisting of 2 nt and the small terminal loop of 4 nt. Domain V was the most conserved domain without any length mutations (Fig. 7). Four interhelical nucleotides, either Ts or Cs, separate the stems of domain V and VI. Domain VI was also strongly conserved around 40 nt and is largely helical with a small terminal loop of 3–8 nt (Fig. 8).

Fig. 1
figure 1

Secondary structure of the petD intron of Idesia polycarpa (Salicaceae). Roman numbers IIV designate the six intron domains. Domain I is subdivided into subdomains AD, with the latter being further subdivided into subdomains D1, D2 and D3. The encircled unpaired adenine in domain IV is the branch point A. Sequences falling in hotspots 6–9 are highlighted in bold. The exon binding sites (EBS 1 and EBS 2) and the intron binding sites (IBS 1 and IBS 2) are highlighted in grey

Fig. 2
figure 2

Structures of the petD group II intron subdomain D2 of domain I across Malpighiales plotted on a simplified phylogeny. Subdomain D2 corresponds to hotspot H6. Note the independent growth of AT-rich stem-loop elements in different lineages that is mainly the result of tandem repeats, e.g., the large size of D2 of Malpighia glabra is due to the 19-nt sequence motif “TTCTTTAATATATTTAATA” that is repeated four times

Fig. 3
figure 3

Structural variability of domain II of the petD intron in Malpighiales. Chrysobalanus possesses a derived structure due to two insertions (in bold)

Fig. 4
figure 4

Structural variability of domain III of the petD intron in Malpighiales. Securinega possesses a derived structure relative to Andrachne due to an inserted simple sequence repeat (in bold)

Fig. 5
figure 5

Structural variability of domain IV of the petD intron in Malpighiales. Bruguiera gymnorhiza possesses a strongly derived sequence due to a multiple simple sequence repeat of the 16-nt motif “TTCATATATGTGTAGA” (highlighted by arrows) that forms a stable stem-loop

Fig. 6
figure 6

Inversion in the petD intron domain IV of Djinga felicis (Podostemaceae). The inverted motif is highlighted in bold

Fig. 7
figure 7

Consensus structure of the highly conserved 37 nt long domain V of the petD intron in Malpighiales. The 14 positions that were variable in the dataset are indicated by ambiguity codes and highlighted

Fig. 8
figure 8

Structural variability of domain VI of the petD intron in Malpighiales. Three microstructural mutations are found, all of which affect the terminal loop. Structures of Dicraeanthus and Medusagyne deviate by obviously independent deletions of 2 or 3 nt, respectively. The phylogenetic context suggests that the two A’s in Medusagyne were already deleted in the common ancestor of Quiinaceae, Medusagynaceae, and Ochnaceae. Lindackeria contains one 13 nt insertion that is a simple sequence repeat (indicated by arrows)

Length mutations

Length mutations were observed in the whole dataset but most of the length variability was found within the mutational hotspots. After excluding hotspots a total of 66 indels in the spacer and 244 in the intron were found (Table 4 in Appendix 2). Small indels were most frequent: 48 of 310 were indels of 1 nt and 130 were between 2 and 10-nt long. Only 23 indels were larger than 50 nt and still nine indels were larger than 100 nt, the largest indel in the dataset spanned 215 nt and was a deletion in domain IV shared by Chrysobalanus icaco and Licania kunthiana (both Chrysobalanaceae), resulting in the absence of nearly half of the domain. Nearly all the other large indels were also located in domain IV where also two inversions of 13 nt were detected in Dicraeanthus and Djinga (both Podostemaceae).

Phylogeny of Malpighiales

After the exclusion of hotspots the aligned matrix comprised 1,548 characters of which 973 were constant, 130 were variable but parsimony-uninformative, and 445 were parsimony-informative. Appending the 310 coded indels, the number of parsimony-informative characters was 554, whereas 331 were variable but parsimony-uninformative. The parsimony ratchet retained 624 shortest trees of 2,277 steps (CI: 0.44 RI: 0.59, RC: 0.26). Including the coded indels resulted in 483 shortest trees of 2,665 steps (CI: 0.49, RI: 0.60, RC: 0.29).

Results from the tree searches are shown in Figs. 9, 10, 11. Malpighiales were supported as monophyletic in all analyses (99% JK, 1.00 PP). The trees from Parsimony and Bayesian analyses differed only in the positioning of some terminals. Only one backbone node was recovered with confidence. Most of the terminal clades, however, received maximum support by jackknife values and posterior probabilities. The phylogram from Bayesian analysis shows that most of the branches leading to the terminal clades of Malpighiales are short. However, branch lengths differ within terminal clades with the longest branches being observed in Turnera grandidentata, Hypericum hookerianum, Hybanthus anomalus and especially in the Podostemaceae.

Fig. 9
figure 9

Strict consensus tree of 483 shortest trees found by the parsimony ratchet based on the petD dataset (excluding hotspots) combined with indels. Tree length: 2,665 steps (CI: 0.49, RI: 0.60, RC: 0.29). Numbers above branches are Jackknife support values (10,000 JK replicates)

Fig. 10
figure 10

The 50% majority-rule consensus tree obtained from Bayesian Inference based of the petD dataset (excluding hotspots) combined with indels. Numbers above branches are Posterior Probabilities. Note the clade comprising Achariaceae, Violaceae, Malesherbiaceae, Turneraceae, Passifloraceae, and a Lacistemataceae–Salicaceae lineage (Violids) that is depicted with high posterior probability congruently to the parsimony tree

Fig. 11
figure 11

Phylogram obtained from Bayesian Inference depicting long branches in the Hypericaceae-Podostemaceae-lineage

A clade of Podostemaceae, Clusiaceae, and Hypericaceae is supported with 100% JK support and a PP of 1.00. Hypericum is sister to the Podostemaceae and the Clusiaceae. Calophyllum appears distant from other Clusiaceae genera Clusia and Garcinia. Euphorbiaceae are found as sister to the Hypericaceae/Podostemaceae/Clusiaceae clade in the parsimony tree, but there is no support for this grouping.

Linaceae are supported as monophyletic with maximal support, although the relationships within the clade are not resolved in the parsimony trees. Irvingia is depicted as sister to Linaceae but support for this grouping is low (0.62 PP). The sister family of Malpighiaceae are Elatinaceae with 83% JK support and a posterior probability of 1.00. Rhizophoraceae are found as sister to Erythroxylaceae with maximum support and both may be sister to the Ochnaceae s.l. clade but this grouping receives only 0.59 PP in the Bayesian tree. A clade comprising Chrysobalanaceae, Dichapetalaceae, Trigoniaceae, and Balanopaceae is supported with 83% JK and a PP of 1.00. Caryocaraceae are additionally found as sister to this clade in the Bayesian trees (0.76 PP). The two former Euphorbiaceae lineages Phyllanthaceae and Picrodendraceae were found to be sister to each other with 96% JK support and 1.00 PP. Pandaceae and Humiriaceae are supported as monophyletic, but their position within Malpighiales or their sister group is not resolved. Ochnaceae, Quiinaceae and Medusagynaceae form a clade that receives maximum support. The only backbone node that is supported as monophyletic with 81% JK and PP = 1.00 comprises Achariaceae, Violaceae, Passifloraceae, Turneraceae, Malesherbiaceae, Lacistemataceae and Salicaceae (including former Flacourtiaceae genera). Turneraceae, Malesherbiaceae, and Lacistemataceae appear in a clade. Moreover, Lacistemataceae are supported as sister to Salicaceae. The Bayesian tree further resolves Achariaceae as sister to Violaceae (0.84 PP) and the Achariaceae–Violaceae clade as sister to a Passifloraceae/Malesherbiaceae/Turneraceae plus Lacistemataceae plus Salicaceae clade.

Discussion

Molecular evolution of the petD intron

The secondary structure calculated for the petD intron of Idesia (Salicaceae) in this study fits very well into the known scheme of group II introns (Hausner et al. 2006; Michel et al. 1989; Qin and Pyle 1998; Toor et al. 2001). Alternative foldings are either energetically less favoured or violate structural constrains essential for correct splicing. Since subdomain D2 and domain IV are highly variable in terms of substitutions and sequence length, a common scheme for all petD introns cannot be inferred. The calculated structures here reflect an optimization based on energy minimization that might only change slightly with advancing energy tables and algorithms. The first detailed study on the petD intron evolution was conducted by Löhne and Borsch (2005). The author’s analysis of frequency of structural partitions (stems, loops, bulges, interhelical single stranded sequence) in the different domains was an approximation based on the annotated consensus alignment by Michel et al. (1989) and visual examinations of the sequences with attention to complementary regions. To the contrary, this study shows the exact distribution of structural elements for the calculated intron structure of Idesia. In this study, all effectively paired nucleotides (Fig. 13) are considered helical. The need for understanding the effects of differential evolution of sequence partitions in phylogeny inference has clearly been pointed out by Kelchner (2002). Future work needs to recognize consensus helical elements by comparing secondary structures in order to group sequence characters that evolve under certain comparable constraints in a certain class.

Mutational hotspots are located in subdomain D2 of domain I, domain II and domain IV, which are the most variable parts of the intron. Already existing datasets for the petD intron, i.e., those of Löhne and Borsch (2005) and the basal eudicots dataset of Worberg et al. (2007) allowed a comparison of hotspot locations. The hotspot in D2 is present in all datasets but is remarkably smaller in basal angiosperms or basal eudicots. Mutational dynamics as well as the AT content are increased in Malpighiales in D2. A hotspot in subdomain C of domain I was found in both studies, but not in the dataset analysed here. A hotspot in domain II is present in the alignment of Worberg et al. (2007) and in this dataset in about the same position. Alignments of different taxon sets basically show highly variable regions (hotspots H8/H9 in Malpighiales) in terminal parts of domain IV but these cannot be assigned to homologous sequence elements in different groups of angiosperms. Possible causes are in deviating mutational mechanisms that lead to insertion of AT-rich elements (see below).

Patterns of sequence conservation correspond to domain patterns of group II introns. Domain I is important for correct splicing and contains several tertiary interaction sites (Pyle and Lambowitz 2006). Besides domain I, Domain V is the only structural element that is essential for the catalytic function of the intron (Lehmann and Schmidt 2003; Pyle and Lambowitz 2006). It is the most conserved element with no length variability in this study. In domain I large parts apart from subdomain D2 are conserved. The percentage of variable characters (46%) is comparable to domain III (41%), but concerning the length of both domains, domain I is by far the more conserved one. Generally, domain IV is considered to be the most variable of all group II intron domains with respect to size and primary sequence (Lehmann and Schmidt 2003; Pyle and Lambowitz 2006). This can be confirmed for petD in Malpighiales (Table 2). Sequence variation in the most conserved domains V and VI affects only their terminal parts. In domain V only one site located in the 4-nt long terminal loop seems freely substituted, exhibiting all four possible nucleotide states in Malpighiales (Fig. 6). In domain VI the branch point A that is essential for the transesterification during the splicing reaction along with many other positions is invariable. The only microstructural changes observed affect the terminal loop (Fig. 7).

The striking length variability of the subdomain D2 is the result of microstructural mutations happening independently in different lineages of Malpighiales (Fig. 2). Observation of sequence motifs revealed that length variability is caused mostly by multiple tandem repeats and poly-T-stretches. As suggested by Levinson and Gutman (1987), sequence motifs once repeated are prone to further duplication. Additional duplications might then involve the template motif and earlier duplicated elements at once, so that multiple repeats can be explained by few steps. Such a pattern is most prominent in the sequence of Malpighia (Fig. 2). To explain the evolution of terminal stem-loop elements in the P8 loop that is part of the trnL group I intron (Quandt et al. 2004) suggested slippage mediated growth of A/T rich sequence elements to have led to independent elongations of P8 in different land plant lineages. This process appears to have led to the stepwise insertion of up to 250 nt. It was further hypothesized that hairpin formation of complementary AT-rich sequence elements results in the stabilization of structure. We believe that similar mechanisms of sequence evolution also occur in subdomain D2 of domain I (Fig. 2) and possibly in domain IV. Figure 5 shows domain IV of Bruguiera gymnorhiza with a multiple tandem repeat of 19 nt. The repeat motif is pairing either with itself or is complementary to other sequence parts of the domain.

In petD of Malpighiales a negative correlation of G/C content and sequence length is evident in domain I and in domain IV, affecting the whole intron (Fig. 12).

Fig. 12
figure 12

Negative correlation of intron length and GC content for (a) the whole intron, (b) domain I, and (c) domain IV. The overall trend that increased size of the intron does not lead to a higher GC content is most prominent in the longest petD intron sequence in the dataset (970 nt) that has one of the smallest GC contents (29.6%)

Microstructural changes are now widely accepted to provide useful phylogenetic information with a low degree of homoplasy, e.g., (Graham et al. 2000; Müller and Borsch 2005; Simmons and Ochoterena 2000). Nevertheless, the mutational mechanisms leading to microstructural changes are far from clear. We have analyzed the effects of a number of larger microstructural mutations (inserted or deleted motifs > 3 nt) on secondary structure. There seem to be two groups of such mutations. One group (Fig. 5) are those in AT-rich terminal stem-loops as discussed above. The other group (Figs. 3, 4) are length mutations that do not occur in terminal loops where their impact on the overall structure would be lowest. In the latter group the inserted repeats lead to the formation of helical secondary structural elements that are GC-rich and therefore stable. In addition, reverse complementary sequence elements to the inserted motif are present in other parts of a domain. Figure 4 illustrates a SSR in domain III that is synapomorphic for Phyllanthaceae (Phyllanthus and Securinega). Compared to the sister taxon Andrachne (Fig. 4; plesiomorphic state without SSR) the inserted motif “GCCTACT” has a complementary 5′ part and leads to an elongated stable stem in Securinega. A similar situation is found in domain II (Fig. 3). The still insufficient resolution of the tree of Malpighiales limits the analysis of the evolutionary history of microstructural changes to unambiguous cases as the ones discussed. The mechanisms that lead to the insertion of long G/C rich, repeated sequence elements may differ from those acting in A/T rich stem-loops, the latter of which are usually compared with slipped strand mispairing (Quandt et al. 2004). Slipped strand mispairing (Levinson and Gutman 1987) seems to be an insufficient explanation for the insertion of rather long (sometimes 20 nt and more) G/C-rich elements because patterns of homoplasy differ between GC-rich domain elements and AT-rich stem-loops. (Borsch et al. 2007) found a strong insertion bias of SSRs in the evolution of the trnT-trnF region in Nymphaeales. However, slipped strand mispairing as it is also considered to occur in satellite sequences (Levinson and Gutman 1987) is expected to result in a stochastic distribution of deletions and insertions of short motifs. Considering our observation of long insertions that lead to stable helical elements in the intron’s secondary structure appears to be in line with this because stable RNA foldings might be less likely affected by negative selection. Further structural comparisons of length variable sequences in a phylogenetic context are likely to provide insights into patterns and mechanisms of intron evolution.

Phylogenetic utility of the petB-petD region at ordinal level and the backbone of Malpighiales

The best so far existing phylogenetic hypotheses for Malpighiales are trees inferred from the multi-gene datasets of Davis et al. (2005), Soltis et al. (2000) and Tokuoka and Tobe (2006). The petD trees also recovered all major lineages inferred by the multigene studies and even resolved additional nodes. The application of petD sequence data in this study provides yet another example that non-coding and rapidly evolving genomic regions entail the same or even more phylogenetic structure than manifold bigger datasets of sequences of coding genes.

The fact that for the first time a backbone node (a clade comprising the seven families Passifloraceae, Malesherbiaceae, Turneraceae, Violaceae, Salicaceae, Lacistemataceae, and Achariaceae) receives significant Jackknife support with plastid DNA data can be taken as further evidence for the phylogenetic utility of petD in Malpighiales. Well supported trees have been inferred based on petD sequence data across angiosperms. Löhne and Borsch (2005) found trees for early diverging angiosperms, comparable to gene trees of matK and trnT-trnF. Worberg et al. (2007) depicted a similar picture for resolving the basal grade of eudicots. One of the so far most comprehensive datasets for different chloroplast spacers, introns and matK with identical taxon sampling is the Nymphaeales dataset of Löhne et al. (2007). A comparison of variability, homoplasy and phylogenetic structure of different group II introns in Nymphaeales revealed the highest values of phylogenetic structure R (Müller et al. 2006) for the rpl16 and the trnK intron, whereas the petD intron had the lowest R value. The petD intron seems to be one of the most conserved group II introns in the chloroplast single copy region. Thus, it will be promising to employ other group II introns, such as those residing in rpl16 or trnK for phylogeny reconstruction in Malpighiales.

The alignment of petD sequences in Malpighiales was straightforward, as experienced in other datasets of angiosperms. Mutational hotspots are well defined (see also discussion above) although not much smaller as compared to those delimited in alignments across basal angiosperms (Löhne and Borsch 2005) or basal eudicots (Worberg et al. 2007). When only a single clade of angiosperms is sampled such as the Malpighiales, it could be expected that overall distances of sequences are smaller, and that accordingly, the hotspots are smaller. However, our data show that this is not necessarily true because of lineage specific effects. Mutational dynamics seems to be increased within hotspot regions in several Malpighiales families, including the above described lineage-specific insertions of A/T-rich sequence elements. In groups of closely related taxa where the respective regions in domains I and IV have a common evolutionary history, additional petD characters can be used at lower taxonomic level.

Relationships within Malpighiales

This study is the first to use non-coding spacer and intron sequences for phylogeny inference of the Malpighiales. Most of the interfamilial relationships found in previous studies were also recovered in our analysis, and several clades received even higher support. An important outcome is that our analysis corroborated the close relationship of Salicaceae, Lacistemataceae, Turneraceae, Passifloraceae, Malesherbiaceae, Violaceae, and Achariaceae which received 83% JK and a PP of 1.0. This group is here called Violids (Figs. 10, 11) to facilitate further discussion. The clade has been previously hypothesized by a combined analysis of ndhF and rbcL data (Davis and Chase 2004) and in the four-gene study of Tokuoka and Tobe (2006) but only with 57% BS and 59% BS, respectively.

Passiflora, Turnera and Malesherbia form a clade that corresponds to Passifloraceae sensu lato of APG II (2003), where an inclusion of Turneraceae and Malesherbiaceae into Passifloraceae was suggested. Passifloraceae and Turneraceae are tropical herbs, shrubs vines, or rarely trees, Malesherbiaceae are a small family of xerophytes native to the Andes and to the arid parts of coastal Chile and Peru. These families formed a clade with 100% support in (Chase et al. 2002; Davis and Chase 2004), as well as in the three-gene study of (Soltis et al. 2000). Chase et al. (2002) found Turneraceae and Malesherbiaceae being sister to Passifloraceae, whereas our petD data provide evidence that Turneraceae and Passifloraceae are sister groups (98% JK, 1.0 PP). The relationship of these three families in respect of floral morphology was discussed recently by Krosnick et al. (2006).

Our analysis recovered Lacistemataceae as sister to Salicaceae with 78% JK and a PP of 1.0. This confirms the findings from two to four-gene studies (Davis et al. 2005; Tokuoka and Tobe 2006) and an analysis using matR sequences (Davis and Wurdack 2004). Salicaceae is here used in its recent and broad definition (APG II 2003) including Flacourtiaceae p.p. The woody pantropical family Flacourtiaceae has been shown to be polyphyletic in all previous molecular analyses. The morphology of Flacourtiaceae is very heterogeneous and the circumscription of the family has always been controversial. Based on a detailed molecular analysis using rbcL, Chase et al. (2002) proposed a splitting of the family: one part was transferred to Salicaceae; the other part was placed in the newly accepted Achariaceae (APG II 2003). Not surprisingly, representatives of the former Flacourtiaceae were retrieved in our analysis in Salicaceae s.l. and Achariaceae, respectively. Since both families are not sister to each other, the separation of Achariaceae as proposed by Chase et al. (2002) is supported by our petD data.

It is noteworthy that the families of the Violid clade were all assigned to the order Violales sensu Cronquist (1981) except Salicaceae s.str. A feature that could be considered a synapomorphy for this clade is parietal placentation. In Cronquist’s system, Flacourtiaceae were supposed to stand “basal” within Violales with supposed affinities to Lacistemataceae. Turneraceae, Passifloraceae, and Malesherbiaceae were considered to be related to each other, but as distinct families that probably have originated in or near Flacourtiaceae. Achariaceae (circumscribed including only the genera Acharia, Ceratiosicos and Guthriea) were also considered as related to Passifloraceae (Cronquist 1981). Salicaceae, consisting only of the genera Salix and Populus were treated as the separate monofamilial order Salicales. However, Cronquist also mentioned that Salicales share many morphological features (such as the numerous stamens, parietal placentation, separate styles and the occurrence of salicin in Salix, Populus and Idesia) with Flacourtiaceae and could be possibly placed near them. Thus, there is as well support from non-molecular characters for the clade of members of the former Violales (plus Salicaceae and Lacistemataceae) depicted in the petD trees.

Clusiaceae and Hypericaceae were always considered as related to each other but were treated differently regarding their taxonomic rank. Some authors, e.g., Takhtajan (1997) and the most recent classification system of APG II (2003) maintained Clusiaceae and Hypericaceae as own families. Other authors considered them as subfamilies within Clusiaceae (e.g., Cronquist 1981). Applying a broad circumscription of the family, Clusiaceae was paraphyletic in a study using rbcL sequences (Gustafsson et al. 2002). The phylogeny presented therein recovered the subfamilies Clusioideae and Kielmeyeroideae as well supported clades, but subfamily Hypericoideae formed a clade with Podostemaceae. A sister group relationship between Hypericaceae/Hypericoideae and Podostemaceae was also recovered by our petD data (100% JK, PP 1.0) as well as in the four-gene studies of Davis et al. (2005) and Tokuoka and Tobe (2006). Since Calophyllum does not appear in the same clade than Clusia and Garcinia, petD data suggest that Clusiaceae might also be paraphyletic to the Hypericaceae–Podostemaceae-clade (Figs. 11, 12, 13) but this requires further testing with additional sequence data and increased taxon sampling. Davis et al. (2005) found that not Clusiaceae but Bonnetiaceae—a family not included in our study—are sister to Hypericaceae/Podostemaceae (with 80% BS). Due to the odd morphology of Podostemaceae it has long been problematic to place them within angiosperms (Soltis et al. 1999) and they seem to have little in common with Hypericaceae. However, a closer look reveals that Hypericaceae and Podostemaceae share also a number of non-molecular characters (Gustafsson et al. 2002). For Podostemaceae our petD data corroborate the close relationship of Dicraeanthus and Djinga (Podostemoideae), whereas Tristichia (subfam. Tristichoideae) is distantly related (Kita and Kato 2001; Moline et al. 2007).

Fig. 13
figure 13

Proportion of structural elements of the petD intron of Idesia. Only those nucleotides that connect the intron domains are referred to as interhelical, single stranded parts or single unpaired nucleotides within domains are referred to as bulges

The monophyly of Malpighiaceae is well supported by rbcL and matK (Cameron et al. 2001) as well as ndhF and trnL-F data (Davis et al. 2001). The floral morphology of Malpighiaceae is unique and distinguishes them from other rosids. Assumptions about the sister group of Malpighiaceae were difficult because of their morphological uniqueness (Cronquist 1981). A first hypothesis based on molecular data came from Davis and Chase (2004), who sampled a broad range of taxa from Malpighiales to establish the sister family of Malpighiaceae that turned out to be the small cosmopolitan family Elatinaceae. Elatinaceae and especially the genus Elatine are mostly aquatic herbs or semi-aquatic shrubs and were formerly placed near Clusiaceae and Hypericaceae (Cronquist 1981; Takhtajan 1997) because of morphological similarities, such as opposite leaves, seed and stem anatomy. However, since the morphological features of Elatinaceae were difficult to interpret, they were also treated as an own order Elatinales by Takhtajan (1997). Our study provides again evidence (88% JK, PP 1.00) that Elatinaceae are sister to Malpighiaceae. There are indeed some morphological and cytological features that link Malpighiaceae and Elatinaceae, as discussed in detail by Davis and Chase (2004). Most notable is the shared chromosome base number of X = 6 (shared only with byrsonimoids), opposite or whorled leaves with stipules, the presence of unicellular hairs and multicellular leaf glands.

Erythroxylaceae and Rhizophoraceae are families of tropical shrubs or trees with simple leaves and cymose inflorescences. Common features are tropane alkaloids and the presence of sieve-element plastids containing protein crystals (Nandi et al. 1998; Setoguchi et al. 1999). Both families may be treated together as Rhizophoraceae s.l. (APG II 2003). This study recovers both families as sisters in line with results of (Savolainen et al. 2000b; Schwarzbach and Ricklefs 2000; Setoguchi et al. 1999) and the three-gene study of Soltis et al. (2000), each with >90% bootstrap support, respectively.

There is evidence for a close relationship between the monogeneric family Medusagynaceae, an endemic family of the Seychelles, and the tropical families Quiinaceae and Ochnaceae. APG II (2003) suggested the inclusion of Quiinaceae and Medusagynaceae into a more widely circumscribed Ochnaceae sensu lato. Ochnaceae s.l. are recovered as a strongly supported (100% JK, PP 1.00) monophyletic group by the petD data as already suggested by all studies that sampled taxa from these families (Chase et al. 2002; Fay et al. 1997; Savolainen et al. 2000b; Soltis et al. 2000). Quiinaceae are probably sister to Medusagynaceae and Ochnaceae, although only Soltis et al. (2000) provided some statistical support (60% JK) for this hypothesis. The most recent study with a broad taxon sampling on these families of Schneider et al. (2006) recovers Ochnaceae, Quiinaceae and Medusagynaceae as monophyletic groups and the authors suggest maintaining them as separate families. The three families were considered to be closely related by Cronquist (1981), who assigned them to the order Theales but without making assumptions about a direct relationship between them. Some morphological features that are common to all three families can be found, such as multilacunar nodes, mucilage cells/cavities, dentate leaves, and bitegmic ovules (Fay et al. 1997).

Euphorbiaceae are a large and highly diverse family of mainly tropical herbs, trees and shrubs. The genus Euphorbia is also very diverse in the Mediterranean Basin, South Africa and East Africa, where it is often succulent and cactus-like. First molecular evidence for the polyphyly of Euphorbiaceae was found by Chase et al. (1993), where Euphorbia appeared as sister to Passiflora and Drypetes as sister to Ochna. Subsequent studies confirmed the assumption that Euphorbiaceae were polyphyletic in their previous circumscription, since they appeared scattered among Malpighiales (Chase et al. 2002; Savolainen et al. 2000b; Soltis et al. 2000). Consequently, two former sublineages of Euphorbiaceae have been segregated as the new families Pandaceae (the former tribe Galearieae) and Putranjivaceae (the former tribe Drypeteae) in the system of APG I (1998). Pandaceae were treated as a separate family related to Euphorbiaceae already in the system of Cronquist (1981). Savolainen et al. (2000b) proposed the additional separation of the subfamilies Phyllanthoideae and Oldfieldioideae that were classified as Phyllanthaceae and Picrodendraceae in APG II (2003). Kathriarachchi et al. (2005) further clarified relationships within Phyllanthaceae and the circumscription of the family. The remaining Euphorbiaceae sensu stricto have been verified to be monophyletic (Wurdack et al. 2005). Most recently, Davis et al. (2007) depicted the parasitic Rafflesiaceae as one of the three major clades within Euphorbiaceae s.str.

A close relationship of Phyllanthaceae and Picrodendraceae was already suggested by Davis and Chase (2004) but only with 53% BS support. PetD data resolve the Phyllanthaceae-Picodendraceae clade with high confidence (96% JK; PP 1.00). Further support comes from morphology with shared features like unisexual, apetalous trimerous flowers, crassinucellar ovules with a nucellar beak, a large obturatur, and explosive fruits with carunculate seeds, which unites both families also with Euphorbiaceae (Merino Sutter et al. 2006).

Our study retrieved a well-supported clade of the small tropical families Balanopaceae, Chrysobalanaceae, Dichapetalaceae, and Trigoniaceae (89% JK, PP 1.00) with Balanopaceae being sister to the rest (89% JK, PP 1.0). This finding is congruent with what was found by Soltis et al. (2000) and Savolainen et al. (2000b). Balanopaceae appeared as sister to the other four families in both studies and APG II (2003) suggests an inclusion of Trigoniaceae, Dichapetalaceae, and Euphroniaceae into an expanded Chrysobalanaceae.

Conclusion

Single non-coding and rapidly evolving plastid genomic regions entail phylogenetic structure that is comparable to the information content of much larger datasets of sequences of coding genes with a manifold higher number of nucleotides sequenced per taxon. As such chloroplast introns and spacers are promising markers to resolve the tree of Malpighiales and other recalcitrant clades. Selecting highly informative genomic regions to be combined in phylogenetic analyses may be more effective than total evidence approaches that combine any kind of sequence data available.

Because of frequent microstructural mutations occurring during the evolution of intron sequences, analytical approaches need to be more complex as compared to sets of length conserved sequences. Secondary structure analyses are helpful to understand patterns and mechanisms underlying microstructural mutations. Intron sequences evolve differently in different domains and levels of sequence conservation vary considerably with respect to different structural partitions. Considering these patterns of intron evolution is essential for homology assessment. Most importantly, hypervariable AT-rich terminal stem-loop elements within domains I and IV may evolve independently in different lineages, and thus have to be excluded from phylogeny inference in matrices comprising distant taxa. Nevertheless, when an alignment principle that is based on recognizing sequence motifs is applied, the recognition of such mutational hotspots is straightforward.