Introduction

Phylogenetic analyses based on molecular data can be misled by a variety of pitfalls such as model misspecification (Posada and Buckley 2004), long branch attraction (Felsenstein 1978), or heterogeneous base composition (Lockhart et al. 1994; Mooers and Holmes 2000; Jermiin et al. 2004) to name a few. Heterogeneous base composition may suggest relatedness of lineages which share similar nucleotide frequencies by chance and not by common descent. Compositional heterogeneity has been reported on different levels of phylogenetic divergence and may not only affect nucleotides but also amino acids (Foster and Hickey 1999; Singh et al. 2009; Nesnidal et al. 2010). There is controversy over the severity of the effects of divergent nucleotide or amino acid frequencies on the accuracy of phylogenetic reconstruction (Rosenberg and Kumar 2003; Jermiin et al. 2004). Nevertheless, as non-stationary data violate the assumptions of standard reconstruction methods, a number of approaches have been developed to account for this issue including LogDet distances and specific substitution models for maximum likelihood analyses (Galtier and Gouy 1995; Boussau and Gouy 2006; Dutheil and Boussau 2008). In addition, RY-coding has been suggested as a remedy for heterogeneous nucleotide frequencies (Phillips and Penny 2003).

In our ongoing phylogenetic and biogeographic analyses of the land snail genus Theba, which naturally occurs in NW Africa, the Canary and Selvagem Islands, as well as on the Iberian Peninsula (Gittenberger and Ripken 1987; Greve et al. 2010; Däumer et al. 2012), we have encountered a number of problems and contradictory results. According to our initial analysis based on fragments of the mitochondrial cytochrome oxidase subunit I (COI) and the internal transcribed spacer 1 of the nuclear ribosomal RNA complex (ITS1), the genus evolved on the Canary Islands and back-colonized the continents. The phylogenetic signal was dominated by COI, however, the third codon positions were inhomogeneous which had to be corrected by RY-coding (Greve et al. 2010), possibly at the cost of information (Sauer and Hausdorf 2010). Subsequently, we analyzed amplified fragment length polymorphisms (AFLPs) and considerably more specimens, which turned the topology upside-down suggesting the origin of the genus in NW Africa and dispersal to the Canary and Selvagem Islands as well as the Iberian Peninsula (Haase et al. 2014). In the same paper, we conducted an analysis based solely on COI and the same set of specimens. Again, homogeneity of the third codon positions had to be established by RY-coding and the resulting topology was similar to the AFLP topology, however, with a different continental clade as a sister group to the remaining clades. In contrast to the AFLP tree, the basal nodes were extremely poorly supported.

In general, mito-nuclear discordance is commonly encountered in phylogenetic analyses and mostly attributed to incomplete lineage sorting, introgression, or unresolved taxonomy (e.g., Avise 1994; Funk and Omland 2003). Alternatively, factors including selection or sex-related asymmetries such as female-biased dispersal are considered (Toews and Brelsford 2012). However, systematic biases in sequence evolution are rarely questioned in this context.

In the present paper, we asked whether the topological ambiguities were due to (1) lack of resolution of COI and/or (2) the heterogeneity of base composition. In order to potentially increase mitochondrial information and resolution, we sequenced a fragment of 16S rRNA. In many phylogenetic analyses on comparable taxonomic levels, 16S rRNA has proved to evolve more conservatively than COI did and thus to provide more information on deeper levels (e.g., Fiorentino et al. 2010; Zielske et al. 2011; Johnson et al. 2012; Palsson et al. 2014). To control for the effects of inhomogeneous base frequencies, we conducted LogDet-distance, maximum parsimony (MP), and maximum likelihood (ML) analyses as well as Bayesian inference (BI), the latter three based on both the original data as well as on RY-coded data. The Bayesian approaches also included analyses allowing for heterogeneous evolutionary rates among lineages (Drummond et al. 2006). With the exception of LogDet, we conducted our analyses based on optimally partitioned data in order to retrieve the maximum information (Phillips and Penny 2003; Lanfear et al. 2012). ML analyses implementing models that take compositional heterogeneity into account were not feasible because of the size of the data set and/or their restriction to unpartitioned alignments (Galtier and Gouy 1998; Boussau and Gouy 2006; Dutheil and Boussau 2008). In a second approach, we tested whether conventional Bayesian analyses would reconstruct the original topology of Theba from inhomogeneous data simulated based on this original topology.

Material and methods

Material and DNA sequencing

Our analyses included 172 of the 182 specimens of Theba analyzed by Haase et al. (2014) (Table 1). We used existing COI sequences and newly sequenced a fragment of 16S rRNA (see below) from the stored DNA extracts, which did not work for ten individuals. The outgroup comprised Cochlicella acuta (O. F. Müller 1774; Geomitridae), Cornu aspersum (O. F. Müller 1774; Helicidae), Drusia deshayesii (Moquin-Tandon 1848; Parmacellidae; formerly in Parmacella, see Martínez-Ortí & Borredà 2013), Obelus despreauxii (D’Orbigny 1839; Geomitridae), and Trochoidea pyramidata (Draparnaud 1805; Geomitridae). For the latest suprageneric classification adopted here, see Razkin et al. (2015). The 16S rRNA fragment was amplified using the primers 16Scs1 and 16Sma2 developed by Chiba (1999). Polymerase chain reactions (PCRs) were performed in a total volume of 11 μl containing 1 μl 10× BH4 reaction buffer (BIOLINE GmbH, Luckenwalde, Germany), 4.4 mM of MgCl, 0.3 pM of each primer, 0.2 mM of dNTP, 0.4 μl of BSA (1 %), 0.2 U of DNA-polymerase (BIOLINE), 50 ng DNA, and dd water. The PCR profile comprised an initial denaturation at 95 °C for 3 min, 35 cycles including denaturation at 95 °C for 30 s, annealing at 50 °C for 30 s, and elongation at 72 °C for 1 min, and a final extension at 72 °C for 7 min. PCR products were cleaned using Exonuclease I (New England Biolabs GmbH, Frankfurt/Main, Germany) and Shrimp-Alkaline-Phosphatase (Promega, Madison, WI, USA). Cycle sequencing was performed using the Big Dye Terminator Ready Reaction Mix v3.1 (Applied Biosystems, Carlsbad, CA, USA) and the PCR primers. After cleaning with CleanSEQ (Beckman Coulter, Beverly, MA, USA), sequences were read in both directions on an ABI 3130xl Genetic Analyzer.

Table 1 Material sequenced

Sequence editing and alignment

Sequences were edited in DNA Baser Sequence Assembler 4.16 (Heracle BioSoft SRL) and initially aligned together with a structure annotated sequence of Albinaria turrita using CLUSTAL W (Thompson et al. 1994). The sequence of A. turrita was originally retrieved from the European Ribosomal Database (de Rijk et al. 2000; van de Peer et al. 2000), which is no longer maintained. The secondary structure of A. turrita served as seed for a structure-informed alignment made in RNAsalsa 0.8.1 (Stocsits et al. 2009). This was then trimmed to 856 base pairs (bp) in BioEdit 7.2.5 (Hall 1999) and concatenated with the 630 bp alignment of the COI fragment (Haase et al. 2014). Aliscore 2.0 (Misof and Misof 2009; Kück et al. 2010) did not detect random similarity; therefore, no masking was necessary. We then defined five partitions: stems and loops of 16S rRNA and the three codon positions of COI (see below). These were separately tested for homogeneity of base frequencies excluding constant sites as proxies for invariant sites (Lockhart et al. 1996) using the X 2 test implemented in PAUP* 4b10 (Swofford 2003). Loops and third codon positions turned out to have heterogeneous base composition (X 2 = 864.44, df = 528, P < 0.001; X 2 = 1707.56, df = 528, P < 0.001). Saturation of substitutions was tested for each partition in DAMBE 5.3.105 (Xia 2013) based only on fully resolved sites as recommended by the program. Saturation may have been problematic only for the third codon positions of COI and then only if the underlying tree was considered unsymmetrical. However, DAMBE simulates saturation indices for Xia et al.’s (2003) test only for up to 32 taxa. Therefore, for considerably larger datasets such as ours, interpretation remains somewhat ambiguous in general. As Aliscore did not detect noisy positions, we considered that lack of phylogenetic signal was not an issue in our data.

Phylogenetic analyses of empirical data

We conducted analyses (1) ignoring heterogeneity of base composition, (2) accounting for heterogeneity of base composition, and (3) accounting for heterogeneity of substitution rates. The first group of analyses comprised MP, ML, and BI. MP was conducted in PAUP* 4b10 with 500 replicates, stepwise addition, and random starting trees. We applied TBR branch swapping and restricted each replicate to 1 million rearrangements. Robustness was assessed by 1000 bootstrap replicates. For ML, we used Garli 2.0 (Zwickl 2006) running 500 replicates for both finding the optimal trees and bootstrapping. BI was performed in MrBayes 3.2.2 (Ronquist et al. 2012) over 8 million generations, saving every 100th tree with a burnin of 25 %. To account for heterogeneity of base frequencies, we constructed a BioNeighbor-joining tree based on LogDet distances (Lockhart et al. 1994) in PAUP* 4b10, removing invariant sites in proportion to frequencies estimated from constant sites. The proportion of invariant sites was estimated in jModeltest v2.1.4 (Darriba et al. 2012). In addition, we recoded the heterogeneous partitions (loops and third codon positions) using R for purines and Y for pyrimidines (RY-coding) and repeated MP, ML, and BI analyses. While the RY-coded loops indeed became homogeneous, the third codon positions remained heterogeneous. Finally, we conducted Bayesian tree reconstructions also in BEAST 1.8.0 (Drummond et al. 2012), implementing the log-normal uncorrelated relaxed molecular clock and a birth-death model as tree prior. We jointly summarized four independent analyses with each 20 million generations, every 1000th tree sampled, and a burnin of 10 %. BI was repeated with the RY-coded alignment, as well. Convergence of parameter estimates in both types of Bayesian analyses were controlled by ensuring that effective sample sizes were larger than 200 as indicated in Tracer 1.6 (Rambaut et al. 2014) and based on the criteria implemented in the respective programs. Prior to the analyses based on substitution models, Partition Finder 1.1.0 (Lanfear et al. 2012), comparing all possible combinations of up to five partitions, confirmed the above partitioning scheme as optimal and selected appropriate models based on the Bayesian information criterion (Table 2).

Table 2 Best fitting substitution models implemented in maximum likelihood (ML) and two types of Bayesian analyses (MrBayes, BEAST)

Phylogenetic analyses of simulated data

In order to test whether inhomogeneous base composition may have influenced the topology of the mitochondrial tree of Theba, we simulated 100 alignments with five partitions of the original length based on the original topology and reconstructed the trees using MrBayes. We did that in Indelible 1.03 (Fletcher and Yang 2009) based on a reduced taxon set comprising five individuals per clade and the outgroup Cornu aspersum in order to save computation time. Indelible allows the simulation of sequences under non-stationary conditions. The backbone tree was constructed in an ML framework with Garli after model fitting with jModeltest (Table 1). As the base of the tree was unresolved, we introduced a branch with length of 0.15 separating outgroup from ingroup. The remaining topology was fully resolved. For the partitions corresponding to those with heterogeneous base frequencies in the original data, partitions 2 (loops) and 5 (third codon positions), we fitted separate substitution models to the four main clades and all older branches based on the results of jModeltest for the original data (see configuration file in Appendix 1). The models used for the five partitions in reconstructions with MrBayes are again listed in Table 2. Every 100th tree of a total of 1 million generations was sampled with a burnin of 25 %. Convergence of parameter estimates was monitored as stated above.

Results

In our presentation of the results, we focus on the inter-relationships of the four main clades. Relationships within these clades are not considered. After RY-coding, only the base composition of the loops was no longer heterogeneous in contrast to the third codon positions (X 2 = 134.33, df = 176, P = 0.999; X 2 = 293.15, df = 176, P < 0.001). Figure 1 shows the LogDet tree with collapsed main clades. A Bayesian analysis with unmanipulated clades is given in Supplement 1. Clade 1 consisted of snails from the Selvagem Islands and Lanzarote, clade 2 was composed of sequences exclusively from the Canary Islands, clade 3 contained mainly samples from NW Africa, and clade 4 snails from NW Africa as well as Europe. The reconstructions based on original data and RY-coded data, respectively, are summarized in Fig. 2. All tree reconstructions gave very similar results, with an ingroup significantly supported only by the LogDet and BEAST analyses. The four main clades were, however, largely well supported and most methods revealed clade 1 as a robust sister group to the remaining three clades. Only both ML analyses showed a polytomy instead of nodes 2 and 3. Except for the LogDet and BEAST analyses, all approaches reconstructed clades 2 and 4 as sister group; however, only MrBayes recovered this with significant support. In the BEAST analysis, node 3 was a polytomy and only the LogDet and the RY-coded BEAST analyses reconstructed clades 2 and 3 as sister taxa, however, with negligible support. Based on RY-coded sequences, MrBayes also recovered node 3 as polytomy which also included parts of clade 3. In general, the approaches supposed to mitigate the effects of heterogeneous base composition did not influence the gross topology, i.e., the relationships of the main clades. RY-coding largely resulted in weaker resolution.

Fig. 1
figure 1

BioNeighbor-joining tree based on LogDet distances with bootstrap support values. Ingroup clades collapsed. For labels of outgroup taxa see Table 1. Expanded clades are shown in Supplement 1. Scale bar = substitutions per site

Fig. 2
figure 2

Fifty percent majority rule consensus tree composed of maximum parsimony (MP), maximum likelihood (ML), as well as Bayesian trees reconstructed in MrBayes and BEAST, respectively. All analyses were conducted with original and RY-coded data. Support values from these analyses are indicated in order below tree (bootstrap values or posterior probabilities). Four outgroup taxa were pruned from tree. Nodes were numbered for reference to the text

The Bayesian reconstructions based on 100 simulated data sets were highly concordant. In at least 95 cases, the scaffold topology was recovered, with the exception of the root node and one node within clade 2 (Figs. 3 and 4). This suggests that the phylogenetic signal largely remained unambiguous despite introducing heterogeneity of base composition in two partitions.

Fig. 3
figure 3

Model tree for sequence simulations generated in a ML framework and based on a reduced taxon set (see Table 1 for taxon labels)

Fig. 4
figure 4

Fifty percent majority rule consensus tree with consensus indices based on 100 summary trees from analyses of simulated data in MrBayes. For labels of taxa see Table 1

Discussion

Just like the protein coding COI, the newly sequenced 16S rRNA gene exhibited segments with homogeneous as well as heterogeneous base composition. It appears that in Theba sections of mitochondrial DNA underlying stronger constraints such as the first and second codon positions or stems evolve rather conservatively with regard to base frequencies, while selectively more neutral sections such as third codon positions and loops show higher variation in substitution patterns. Thus, by generating more data, we even increased the proportion of sites with inhomogeneous base composition as the loop sections comprised 75 % of the entire 16S rRNA fragment.

However, the general picture of tree reconstruction was the same for the standard approaches as well as those taking inhomogeneous base frequencies into account. Augmenting the mitochondrial data by 16S rRNA slightly increased the support for the deeper nodes of the Theba phylogeny compared to our foregoing analyses (Greve et al. 2010; Haase et al. 2014). However, in accordance with Greve et al. (2010), the topology still suggested an origin of the genus on the Selvagem and Canary Islands with subsequent colonization of the continents in contrast to the AFLP data. In addition, some relationships among the main clades remained ambiguous. The poorly supported topology of the COI tree in Haase et al. (2014), with the Moroccan-Mediterranean clade corresponding to our clade 4 as sister group to the remaining clades, may have been due to over-parameterization in RAxML (Stamatakis 2006) implementing GTR as substitution model. We repeated the analysis of the foregoing paper in a different version of RAxML offering also HKY85 and K80, which have fewer parameters. While the HKY85 topology corresponded well to the one based on GTR, the topology based on the most simple model K80 was indeed very similar to the one reported by Greve et al. (2010) (data not shown). Well into the age of phylogenomics, it is now generally accepted that increasing the number of sites increases the accuracy of phylogenetic reconstructions. This has also been observed in studies investigating the effects of heterogeneous base composition (Rosenberg & Kumar 2003; Jermiin et al. 2004; Betancur-R et al. 2013), suggesting that adding a second sequence reduced the ambiguity in the phylogenetic signal of COI and resulted in a more accurate and robust reconstruction.

Comparing trees based on RY-coding with those reconstructed from empirical data, we observed both the desired effect, i.e., improved resolution with respect to support (Phillips and Penny 2003; Ishikawa et al. 2012), as well as nodes that received less support. The latter was probably due to loss of information (Sauer and Hausdorf 2010).

Our simulations confirmed the reconstructions based on real data. The Bayesian analyses assuming stationary evolutionary processes were not misled by the introduction of inhomogeneous base composition and recovered the original tree topology that was used to simulate sequence evolution. The only exceptions were lack of support for a single node within one of the main clades and for the root node. In conclusion, the phylogenetic signal of mtDNA in the land snail genus Theba appeared to be robust despite considerable inhomogeneity of base composition. A Bayesian analysis of the original data excluding the inhomogeneous partitions was considerably less resolved (not shown), confirming the information content of the excluded data.

The case of Theba is concordant with several other phylogenetic studies which have not been affected by heterogeneous base frequencies (Rosenberg and Kumar 2003). Conditions under which compositional heterogeneity becomes a problem have only rarely been investigated. Simulations suggested that extreme changes in base frequencies are necessary to mislead phylogenetic analyses (Van Den Bussche et al. 1998; Conant & Lewis 2001) or that inhomogeneous base frequencies in combination with other confounding effects such as rate heterogeneity among lineages may generate problems (Ho & Jermiin 2004). Jermiin et al. (2004) showed that short internal branches may not be recovered if base composition is not homogeneous across taxa. As the internal branches of our mitochondrial Theba phylogeny had considerable lengths, this may explain why compositional heterogeneity had no detrimental effects.

In general, the effects of non-stationary evolutionary processes on phylogenetic reconstruction still appear to be poorly understood and are probably highly dependent on the actual data. Finally, the incongruence in the phylogenetic signal of mitochondrial and nuclear data in Theba is probably real and most likely a consequence of incomplete lineage sorting (see Toews and Brelsford 2012).