Introduction

The last decades have seen an exponential increase in molecular phylogenetic studies of angiosperms and emerging consensus at higher levels. The order Fabales Bromhead was one of the most surprising angiosperm clades to result from early studies of interfamilial relationships. Since four families of Fabales are very diverse morphologically, (APG III 2009; Bello et al. 2009); until DNA sequence data became available, most of classification systems placed only the Leguminosae (Fabaceae) Juss. In the order Fabales, while the other families now placed in the order, Polygalaceae Hoffmanns. & Link, Surianaceae Arn., and Quillajaceae D. Don, appeared in different taxonomic groups (Bello et al. 2009).

Molecular studies and fossil evidence suggest an ancient origin and rapid radiation for Fabales (e.g.,., Crane et al. 1990; Zi-Chen et al. 2004; Lavin et al. 2005; Pigg et al. 2008; Bello et al. 2009) (note that the unconfirmed fossils of Polygalaceae and Surianaceae, and there is still the possibility of incomplete fossil record of Fabales). The monophyly of the order is strongly supported by several studies (e.g., Bello et al. 2009, 2012; APG IV 2016), but the overall phylogenetic relationships across the order and position of the root remain controversial; a situation common in higher-level phylogenetic studies of ancient, rapid radiations. (Bello et al. 2009). Previous studies which have recovered different interfamilial topologies for Fabales have used different DNA regions and have very different and unbalanced taxon sampling (e.g., Crayn et al. 1995; Doyle et al. 2000; Savolainen et al. 2000; Soltis et al. 2000; Kajita et al. 2001; Persson 2001; Wojciechowski et al. 2004; Lavin et al. 2005; Forest et al. 2007; Bruneau et al. 2008; Soltis et al. 2011). Phylogenetic instability has been attributed not only to the putative rapid radiation in the early history of the order, but also to sampling directed above (i.e.,. angiosperms) or below (i.e., Leguminosae, Polygalaceae) the ordinal level (Bello et al. 2009). Nevertheless, even studies focused on Fabales could not yield robust relationships for the order (Table 1).

Table 1 Summary of previous studies focused on Fabales. In Forest’s 2004 study, “complete” refers to “all taxa regardless of the missing sequences” and “partial” refers to “only taxa for which all of the DNA regions were sequenced”

The most comprehensive studies addressing the phylogeny of order Fabales were by Bello et al. (2009, 2012). In their first study, five different topologies were recovered using maximum parsimony (MP) and Bayesian analysis (BI) based on the rbcL and matK plastid regions (Table 1). The Shimodaira-Hasegawa test (Shimodaira and Hasegawa 1999) they conducted favored a resolved topology over a polytomy, but none of the five possible topologies outlining the relationships between the four families of Fabales received a significantly better likelihood. In all analyses, Fabales and each of its component families were monophyletic and support values were mostly very high for all these clades. However, all five topologies for interfamilial relationships within the order received low-to-moderate support, an observation common to many rosid orders and attributed to rapid, early radiation within Fabales (Bello et al. 2009; Wang et al. 2009). Furthermore, Bello et al. (2009) reported that the stem age estimate for Leguminosae, Polygalaceae and the pair Surianaceae + Quillajaceae have very similar ages, which would support the idea of a rapid radiation in the early history of the order.

In their second study (Bello et al. 2012), two hypotheses emerged from the combination of 66 morphological characters with previously published rbcL and matK plastid regions. The morphological characters described floral development and anatomy, and MP and BI analyses were used to explore three data sets which differed in the proportion of missing data and in the choice of outgroup taxa (Table 1). The two recovered topologies were (((S + Q)L)P) and (L + P)(S + Q), with the latter only recovered from BI analyses of the most densely sampled matrices (Table 1). The most frequently recovered topology, (((S + Q)L)P) was considered the most likely in the light of morphology, in spite of low-to-moderate support from both MP and BI analyses.

Despite the attention phylogenetic relationships within Fabales has received, a well-supported interfamilial topology remains elusive. This unresolved phylogeny problem of Fabales also causes unanswered evolutionary questions such as estimating diversification rates (e.g., Smith et al. 2011; Koenen et al. 2013) and understanding trait evolution and biogeography. Therefore, an unambiguous phylogenetic answer for the four Fabales families is required. Moreover, the genomic markers used to date in phylogenetic reconstructions within the order have mostly been from the plastid genome. However, the prevailing view is that nuclear and plastid DNA sequence data are needed to fully understand flowering plant evolutionary history, because nuclear regions can provide insights into hybridization, polyploidy and reticulation (Sang 2002; Álvarez and Wendel 2003). Therefore, in the present study, 26S rDNA sequence data are explored alongside previously published sqd1 data from the nuclear genome, and matK and rbcL data from the plastid genome.

sqd1 (UDP sulfoquinovose synthase gene) is a low copy nuclear gene and it is one of the five conserved orthologue set (COS) markers highlighted in a survey of universally amplifiable markers; it is 267 base pairs (bp) long in Angiosperm families, easy to align due to the lack of indels and highly parsimony informative (Li et al. 2008). Babineau et al. (2013) screened the phylogenetic utility of 19 low copy nuclear genes for caesalpinoid legumes, and they highlighted that the sqd1 region has a potential for familial to tribal-level resolution with almost 30% of parsimony informative characters.

The 26S nuclear ribosomal DNA (rDNA) has been used in several phylogenetic studies (e.g., Fan 2001; Soltis et al. 2001; Zanis et al. 2003; Weitemier et al. 2015; Xu et al. 2015). It has potentially many advantages for phylogenetic reconstruction: (1) it consists of both variable and conserved regions suitable for closely and distantly related taxa; (2) it has very high copy numbers making amplification generally easy with mostly universal primers (Baldwin et al. 1995; Bailey et al. 2003; Weitemier et al. 2015; Xu et al. 2015); and (3) like all nuclear loci, it is biparentally inherited providing insights into hybrid parentage, polyploidy events and reticulation (Álvarez and Wendel 2003). However, some drawbacks were also reported related to its high copy number, such as intra-individual and intra-genomic variation with multiple copy types found within individuals, often incomplete and bidirectional homogenization of copy types, incomplete concerted evolution, paralogy problems, secondary structures, high GC content and the presence of potentially non-functional pseudogene sequences (Hillis and Dixon 1991; Baldwin 1992; Baldwin et al. 1995; Soltis and Soltis 1998; Alvarez and Wendel 2003; Bailey et al. 2003). Among them the view on inclusion/exclusion of pseudogenes changes from one study to another (Bailey et al. 2003). While some authors exclude potential pseudogenes due to alignment or long-branch attraction concerns (LBA; Felsenstein 1978), others include them to address issues related to the potential reticulate evolution of taxa. Many approaches such as pairwise comparisons and tree-based methods were applied to detect these pseudogenes (e.g., Hughes et al. 2002).

Despite the apparent early enthusiasm for the 26S gene and its potential in phylogenetics, the 26S rDNA region’s popularity fell due to the increased interest for low-copy nuclear genes and the low phylogenetic signal subsequently reported for the region (Soltis et al. 2011). The extent of how above-mentioned issues affect phylogenetic reconstruction varies among groups of organisms. For example, phylogenetic studies rated the inclusion of the 26S conserved rDNA sequences from useful (e.g.,. Fan 2001; Neyland 2002; Soltis et al. 2011) to inconsistant (e.g., Ro et al. 1997; Muellner et al. 2003).

The matK plastid region is one of the most frequently employed genes in phylogenetic analyses (e.g., Hilu et al. 2003; Luckow et al. 2003; Wojciechowski et al. 2004; Lavin et al. 2005; Kim and Kim 2011; Wanntorp et al. 2011; Kim et al. 2013; LPWG 2017). It was shown, not only for Leguminosae but also for Fabales, that this plastid gene successfully resolves many relationships with high support due to its high substitution rate (Lavin et al. 2005; Bello et al. 2009; LPWG 2017). Similarly, the rbcL region is another commonly sequenced plastid gene for Fabales. While the use of this gene for Fabales was not recommended (Bello et al. 2009), nor was it as useful as matK for Leguminosae (Lavin et al. 2005), the possibility of it contributing to a robust combined analysis should not be ruled out.

In the present study, a broader outgroup sampling compared to previous studies of Fabales was employed to reduce tree imbalance artefacts (Smith 1994), and particularly to reduce problems associated with LBA (Felsenstein 1978) by breaking long branches between the ingroup and outgroup. The 34 outgroup taxa used here were chosen to represent each family from seven Fabidae orders. Additionally, as well as combining new nuclear sequence data and previously published nuclear and plastid regions, these regions were compared to investigate possible incongruence between them. Lastly, three analytical methods MP, maximum likelihood (ML) and BI were used to investigate how these approaches perform with the new data sets.

Materials and methods

Taxon sampling

Total genomic DNA samples used in Forest (2004) were newly sequenced here for 26S rDNA. The National Center for Biotechnology Information (NCBI/GenBank) accession numbers for previously published and newly produced DNA sequences are provided in “Appendix,” including 70 26S rDNA sequences. The taxon sampling list is organized according to the most recent classification system (e.g., Gagnon et al. 2016 and LPWG 2017). We included 34 taxa from seven different orders of Fabidae as outgroup taxa.

DNA extraction, amplification and sequencing

Approximately 950 bp of the 5′-end of the 26S rDNA gene was amplified using primers N-nc26S1 and 950rev (Kuzoff et al. 1998). Amplification was performed using the following program: 2 min at 94 °C, 32 cycles of 45 s at 94 °C, annealing at 55 °C for 1 min, 1.5 min at 72 °C, and a final extension of 5 min at 72 °C. When PCR product yields were too low, one of the following additional steps was performed: (1) an increase in number of cycles (e.g., up to 35 cycles); (2) an additional PCR run using identical parameters as above repeated with 8 to 10 cycles; (3) three identical non-modified reactions pooled together on the same column for the cleaning step. All PCR products were purified with the QIAquick PCR purification kit (Qiagen inc.) and eluted in EB buffer (10 mM Tris). Complementary strands were sequenced on an ABI 377 or ABI 3100 automated sequencer following the manufacturer’s protocols. The same primers were used for amplification and for the cycle sequencing reactions. Seventy previously unpublished 26S rDNA sequences were included (Forest 2004), and 15 were downloaded from GenBank (“Appendix”). A total of 85 samples were included, 43 from Leguminosae, 17 from Polygalaceae, four from Surianaceae and 21 outgroup taxa representing diverse Fabidae orders. Unfortunately, 26S region could not be amplified for Quillaja.

Since sequencing results do not clearly indicate the presence of paralogous copies and/or pseudogenes (e.g., no significant double peaks in chromatograms), this has not been investigated further here for the 26S nuclear gene region.

Phylogenetic analyses and model selection

Sequences were assembled and aligned using the Geneious alignment option in Geneious Pro 4.8.4 (Kearse et al. 2012) with the automatic pairwise alignment tool and subsequently edited manually. Equivocal base calling at the beginning and end of assembled complementary strands were trimmed. All indels were scored as missing data. Eight different combined analyses were performed to explore the results obtained with the newly produced 26S and published sqd1 nuclear partitions separately and in combination with published matK and rbcL sequences (sqd1 alone, 26S alone, 26S + sqd1 combined, matK + rbcL combined, sqd1 + matK combined, 26S + sqd1 + matK combined, sqd1 + matK + rbcL combined, and 26S + sqd1 + matK + rbcL combined); details of each analysis are presented in Table 2. The substitution models for each of the individual genes were estimated using jModelTest2.1.10 (Guindon and Gascuel 2003; Darriba et al. 2012).

Table 2 Eight phylogenetic analyses of order Fabales performed with different data sets

Maximum parsimony analysis was performed using PAUPRat (parsimony ratchet searches using PAUP*; (Sikes and Lewis 2001) as implemented on the CIPRES portal ((Miller et al. 2010); https://www.phylo.org/). Heuristic searches were performed with 1,000 replicates with tree-bisection-reconnection (TBR) branch swapping and a maximum of 1,000 best trees kept. All characters were equally weighted and unordered. Strict consensus trees were generated using PAUP and all the best trees found.

Maximum likelihood analysis was performed using RAxML version 8 (Stamatakis 2014) as implemented on the CIPRES portal ((Miller et al. 2010); https://www.phylo.org/). The GTRGAMMA model was applied to each partition individually, and default maximum likelihood search options were selected with 1000 bootstrap replicates. The best scoring trees with bootstrap values were saved.

Bayesian analyses were conducted using MrBayes 3.2.7a (Ronquist et al. 2012) as implemented on the CIPRES portal ((Miller et al. 2010); https://www.phylo.org/). The same GTR + G + I model of molecular evolution as for ML was applied. MrBayes was run with four (one cold and three heated) Monte Carlo Markov chains (MCMC) and for 100 million generations, sampling one tree in every 1,000 generations. This was repeated twice as independent runs, and the resulting parameter files were jointly visualized in Tracer (Rambaut and Drummond 2003) to ensure convergence. Among the 100,000 trees thus obtained, the first 25,000 trees (25%) were discarded as “burn-in”, and a maximum credibility tree and associated posterior probabilities were compiled using the remaining 75,000 trees and the “halfcompat” option of the “sumt” command. Images of the phylogenetic trees were produced using the Interactive Tree of Life (iTOL) online tool (https://itol.embl.de/) (Letunic and Bork 2016).

Alternative topology testing

The approximately unbiased (AU) (Shimodaira and Hasegawa 1999) test was used to evaluate the alternative phytogenetic relationships of the four Fabales families. For each alternative topology, P values were calculated by W-IQ-TREE (https://iqtree.cibiv.univie.ac.at/, Trifinopoulos et al. 2016) by using 10,000 bootstrap replicates and our 26S + sqd1 + matK + rbcL combined alignment.

Results

The GTR + G + I model of molecular evolution was selected as the most suitable for each of the individual genes. In the following sections, the results of the ML and BI analyses are highlighted with MP topology summaries presented in Table 3 alongside those obtained from the ML and BI analyses. Only bootstrap support values above 50% or posterior probabilities above 0.95 are discussed. Alignment details for all datasets are also summarized in Table 2 (Online resource 1–8).

Table 3 Summary of phylogenetic trees from nuclear sqd1, sqd1 + matK combined, nuclear 26S, nuclear 26S + sqd1 combined, plastid matK + rbcL combined, 26S + sqd1 + matK combined, sqd1 + matK + rbcL combined and 26S + sqd1 + matK + rbcL combined analyses

Fabales is found to be monophyletic in all analyses based on sqd1 (MP, ML and BI), but interfamilial relationships other than the Leguminosae-Polygalaceae pair were not resolved (Table 3, Online resource 9). Polygalaceae is monophyletic in all analyses, and Xanthophyllum sp. is retrieved as sister to the remainder of the family. Within the monophyletic Leguminosae, all six newly recognized subfamilies are also monophyletic, except in the MP analyses in which subfamily Papilionoideae is paraphyletic. For the analyses performed with the 26S rDNA region alone (Online resource 10), both Fabales and its constituent families were resolved as monophyletic in the ML analysis (only 57%) and BI analysis (posterior probability of 1.0), but not in the MP analysis. However, the position of both Detarium (a member of subfamily Detarioideae) and Acrocarpus (a member of subfamily Caesalpinioideae) within Caesalpinioideae and Papilionoideae, respectively, was never seen in any previous analyses (e.g., LPWG 2017),

In the nuclear 26S + sqd1 ML analysis (Online resource 11), except Caesalpinioideae and Detarioideae, the remaining subfamilies were monophyletic. However, in the plastid matK + rbcL ML analysis, the phylogenetic relationships of the six subfamilies support the new classification of the LPWG (2017), all the subfamilies were monophyletic (Online resource 12). In both analyses (matK + rbcL and 26S + sqd1), Leguminosae was sister to Polygalaceae (with only 60% bootstrap support compared to 68% from the nuclear regions analysis). Quillajaceae was sister to Surianaceae with 85% bootstrap support in the plastid ML analysis, while in the nuclear tree the position of these two families was not resolved. Lastly, in contrast to highly supported monophyletic Fabales (100%) in the plastid tree, in the nuclear tree the monophyly of the order Fabales was supported by only 71% bootstrap support.

The 26S + sqd1 + matK + rbcL ML analysis yielded monophyletic Fabales (100%), Fabales families, Leguminosae subfamilies and Polygalaceae tribes (Fig. 1). While a (L + P)(Q + S) topology was observed with moderate bootstrap support (90% bootstrap support for (L + P) and 88% bootstrap support for (Q + S)). Within Leguminosae, all six subfamilies were monophyletic. Within monophyletic Polygalaceae (100%), Xanthophylleae was sister to the remainder of the family.

Fig. 1
figure 1

Maximum likelihood tree of 26S + sqd1 + matK + rbcL analysis. Outgroup taxa, Polygalaceae, Surianaceae, Quillajaceae and Leguminosae with six subfamilies (Cercidoideae, Detarioideae, Duparquetioideae, Dialioideae, Caesalpinioideae and Papilionoideae) are indicated. Bootstrap values are indicated below branches

The addition of 26S rDNA data to the other data sets did not yield higher support or better resolution (Tables 3 and 4). In contrast to 83% bootstrap support for the (L + P) clade in the sqd1 ML tree, this clade was supported with 68% bootstrap support in the sqd1 + 26S ML analysis. Similarly, the addition of 26S nuclear data to the sqd1 + matK and sqd1 + matK + rbcL did not yield better results. When matK is added, generally higher support values were obtained for all analyses, however when the rbcL is added, slightly lower values were observed (Tables 3 and 4).

Table 4 Comparison of analyses with 26S molecular data included/excluded, with/without matK and with/without rbcL. Results of only the ML analyses are shown

Lastly, our approximately unbiased (AU) test analysis showed that ((L + P)(S + Q)) topology (1) was not significantly better than the other hypotheses (Table 5).

Table 5 Topology test for the phylogenetic relationships of the four Fabales families

Discussion

Our results have shown that, while the sqd1 nuclear region may not be helpful in solving Fabales phylogeny problems on its own due to reduced support for interfamilial relationships, it can be used in combination with other regions such as matK. On the other hand, there was no difference with regard to phylogenetic relationships between analyses including 26S and those excluding it. While our sequencing results do not clearly indicate the presence of paralogous copies and/or pseudogenes (please note that this has not been investigated in depth here with additional analyses), it is possible that our 26S dataset includes paralogous copies and/or pseudogenes which are causing Caesalpinioideae and Papilionoideae to be represented as non-monophyletic. Indeed, similar results were reported by a recent study (Maia et al. 2014) using both 26S and 18S nuclear regions in an angiosperm-wide study (e.g., non-monophyletic Fabales, Leguminosae and Polygalaceae). Furthermore, lack of support across the majority of nodes in the 26S tree, especially for Leguminosae, is another concern (Online resource 11), which could be linked to the conserved nature of the region (Kuzoff et al. 1998). Therefore, the inclusion of 26S in any phylogenetic study should assess possible paralogy problems, as well as how its contribution to support and topology is compared to analyses excluding it.

Our results have shown that both the topology and the root of the order change according to choice of genes and the analytical methods (Table 3), which was also common in the previous studies that focussed on Fabales. Moreover, two possible topologies were recovered from our analyses, (L + P)(Q + S) obtained for most analyses, and (((L + P)S)Q) for MP analyses of 26S + sqd1 (Table 3). Overall, our results indicate that the ((L + P) (S + Q)) topology is the most likely; which is the same topology that was recovered from the BI analyses of matK and matK + rbcL by (Bello et al. 2009) and again from the BI analyses of matrix A and C of (Bello et al. 2012) (Table 1). However, similar to the previous studies (e.g., Forest 2004; Bello et al. 2009, 2012), it was found that both ML and BI analyses yielded low support values for the interfamilial relationships within Fabales. Furthermore, none of the seven different topologies were rejected by the AU test of our combined data, and the first three topologies were not significantly different from each other (Table 4). Indeed, this may indicate that the phylogenetic signal in the internal branches of Fabales is very weak that it is open to any small changes, which is a common feature of rapid radiations (Rota-Stabelli and Telford 2008; Roberts et al. 2009). However, Fabales is not one of the hard polytomy cases reported to date (Bello et al. 2009), in which the genes that are used may not have any phylogenetic signal for the internal branches (Braby et al. 2005; Whitfield and Kjer 2008; Kodandaramaiah et al. 2010).

Lack of resolution is a common problem across Angiosperms in general (e.g., Zeng et al. 2014; Huang et al. 2015; LPWG 2017) and there are several common reasons underlying not only unresolved rapid radiations but most phylogenetic problems, such as, gene tree incongruence due to biological events (e.g., whole genome duplication (WGD), hybridization, introgression, horizontal gene transfer, incomplete lineage sorting (ILS), extinction) (e.g., Koenen et al. 2019), outgroup problems (i.e., lack of an extant outgroup/closely related outgroup or the effect of the outgroup on ingroup topology) (e.g., Huerta-Cepas et al. 2014), or just systematic errors such as taxon sampling (Thomas et al. 2013), appropriate outgroup choice (i.e., possible systematic biases related to the outgroup sequences, such as low substitution rate and not ingroup-like G + C composition) (e.g., Rota-Stabelli and Telford 2008), LBA (e.g., Qui et al. 2001), inadequate data and inaccurate model implementation (e.g., Reddy et al. 2017; Morgan et al. 2013).

A recent study has shown that the root of Leguminosae is particularly difficult, due to several WGD events, a combination of short internal and long external branches (i.e., extinction and rapid divergence, respectively), ILS and/or reticulation (Koenen et al. 2019) (please see also Cannon et al. 2015 and Wong et al. 2017). Furthermore, it was also argued that obtaining a fully bifurcated legume tree may not be possible due to the simultaneous/near-simultaneous origin of the family (Koenen et al. 2019). Indeed, conflict is very widespread, and it is quite possible that every gene tree is incongruent with the species tree, with these incongruences being stronger for the short-internal nodes (Salichos et al. 2014), and the same evolutionary history would also be possible for the order Fabales, and even thousands of genes may not be enough to solve the Fabales phylogeny, similar to the case of Leguminosae. On the other hand, we think that LBA may not be a problem for Fabales, because in the presence of LBA the root of the group is not stable when sampling different outgroups (Qui et al. 2001), which is not the case for Fabales (e.g., Bello et al. 2009, 2012; current study). Furthermore, to overcome a possible LBA problem, we employed a broad outgroup sampling strategy (Smith 1994; Lyons-Weiler et al. 1998; Djernaes et al. 2012; Drew et al. 2014) and performed Bayesian analyses that are less vulnerable to LBA artefacts, compared to parsimony analyses (Bergsten 2005), yet both the root and topology of the tree changed according to the phylogenetic method, and genes used. However, the effect of data sampling, model implementation, outgroup choice and taxon sampling need further analyses, and future studies should focus on these possible causes for the unresolved Fabales phylogeny.

In conclusion, as with previous studies, this study did not find well-supported dichotomous relationships among the four Fabales families, which may indicate a rapid-near-simultaneous evolution of the four Fabales families. Therefore, it should not be concluded that ((L + P)(Q + S)) is the “definitive answer” for relationships within Fabales, as there is still a need for further studies to not only confirm whether ((L + P)(Q + S)) or another topology is the right answer for the order, but also to reveal the underlying reason for the unresolved phylogeny within Fabales. However, we think that this and previous studies dealing with interfamilial Fabales relationships will provide the framework for future genomic studies that address the issue. Further work is certainly needed to solve the Fabales puzzle with confidence, and to approach the underlying problem from a direction other than employing conventional phylogeny methods.