Introduction

Major changes in habitat or lifestyle, such as the transition from marine to freshwater environments, are among the most conspicuous and interesting evolutionary events. Dispersal by animals into freshwater habitats from marine systems is largely prevented by the substantial differences in osmotic pressure and ionic concentration between these habitats (Lee and Bell 1999). Maintaining appropriate concentrations of various ions is critically important to the functioning of living cells. For freshwater organisms this means maintaining a much higher concentration of ions in the cellular fluid than is found in their environment and typically expending more energy on osmotic and ionic regulation compared to marine organisms (Generlich and Giere 1996; Lee and Bell 1999). That concentration of ions creates a continual influx of water, which the organisms must eliminate while conserving ions (Beadle 1957; Evans 2008). As a result, freshwater organisms typically face a much greater challenge in osmotic and ionic regulation than marine organisms do (Péqueux 1995). The interface between marine and freshwater environments thus creates a barrier, the magnitude of which can be surmised by how few major animal groups have managed to overcome it. The geological record indicates that while marine habitats were occupied by ecologically complex communities within 100 million years of the appearance of multicellular animals, freshwater habitats remained mostly uninhabited for another 200 million years (Miller and Labandeira 2002). Only about half of the ~ 35 extant animal phyla have representatives in both marine and freshwater environments (Little 1983, 1990), and even within some of those phyla there are major clades with no freshwater representatives (cephalopods, chitons and scaphopods, for example, are exclusively marine, while Gastropoda and Bivalvia have representatives in both marine and freshwater environments). Interestingly, within many clades with freshwater representatives, the invasion of freshwater appears to have occurred repeatedly (Lee and Bell 1999). Understanding the adaptations that make those lineages better able to colonize freshwater environments will provide valuable insight into the evolutionary processes and pressures involved in these transitions.

Six phyla (Arthropoda, Chordata, Mollusca, Annelida, Nematoda, and Rotifera) represent the vast majority of freshwater animal species (Balian et al. 2008). Five of the six phyla are protostomes, of which two (Arthropoda and Nematoda) belong to the clade Ecdysozoa while the other three (Mollusca, Annelida, and Rotifera) are members of the clade Spiralia (Giribet 2008; Marlétaz et al. 2019). Arthropod and chordate species are the basis for most studies of adaptation to fresh water, but these species may not necessarily be good models for how species in other groups have overcome this barrier. Investigating how animals in other clades have adapted to fresh water is important in understanding what types of adaptations may be universal to freshwater animals and which ones are clade specific. The exoskeleton of most aquatic arthropod groups and scaly skin of teleost fish provide a relatively impermeable barrier to water and ions, so much of the ionic stress in those groups is expected to be localized, occurring mostly in the gills as well as the digestive and excretory systems. Spiralian animals may face challenges even beyond those of fish or arthropods when colonizing freshwater habitats. Soft-bodied animals such as annelids and mollusks face similar challenges in their digestive and excretory systems, but also must adapt to potential ion loss and osmotic pressures across their epidermis (Schnizler et al. 2002; Krumm et al. 2005). Many mollusks and annelids secrete mucus, a viscous colloidal suspension comprising glycoproteins that is involved in osmotic and ionic regulation across their body surface (Schnizler et al. 2002; Evans 2008; Creencia and Noro 2018). By focusing on mollusks and annelids, we hope to find convergent adaptations in the freshwater lineages of those groups, which may or may not also be present in arthropods and chordates.

Several families of genes produce proteins that pump ions against concentration gradients. These proteins are involved in creating the ion gradients that many vital cellular functions depend on. Other proteins that help control the movement of ions between cells, such as ion channels, are also integral components of functional cells. Given the importance of these proteins, we might expect there to be little opportunity for directional selection, either in function or regulation, yet some type of change to at least some of these proteins is likely necessary for adaptation to freshwater environments. Gene duplications can provide opportunities for genes normally under strong purifying selection to undergo changes. Gene duplications are predicted to create redundancy, resulting in relaxed selection on the duplicated genes (Ohno 1970). Gene duplications may also provide an increase in the dosage of the protein product (Kondrashov et al. 2002), which may play a role in the increased magnitude of ionic regulation.

There is evidence that expansions of several ATPase gene families were involved in freshwater colonization by annelids (Horn et al. 2019). To investigate the possible role gene family expansions played in the freshwater invasions of other spiralian lineages, we downloaded the amino acid sequences of available genomes for representatives of both marine and freshwater spiralians from public databases (Table 1). We compared the number of gene copies between the marine and freshwater taxa for the gene families identified by Horn et al. (2019). We also identified other gene families that experienced expansions along freshwater lineages and used GO enrichment analysis to identify what molecular functions are overrepresented by those families. We expect to see the same family expansions Horn et al. (2019) identified in annelids repeated in other spiralian freshwater lineages. We also expect to identify several other gene families of interest that can serve as a starting point for continued exploration of spiralian adaptations to freshwater environments.

Table 1 Source of sequence data, assembly information, and reference for each genome assembly is listed along with the phylum, and habitat for each species used

Methods

We limited our data sampling to publicly available gene sets from sequenced genomes of free-living spiralian animals. Amino acid sequences from the Notospermus geniculatus and Phoronis australis genomes were downloaded from the Okinawa Institute of Science and Technology Graduate University Marine Genomics Unit (Luo et al. 2018), amino acid sequences from the Schmidtea mediterranea genome were downloaded from Wormbase ParaSite (Robb et al. 2008), amino acid sequences from the Biomphalaria glabrata genome were downloaded from VectorBase, and amino acid sequences from genomes for each of the other organisms were downloaded from Ensembl Metazoa, part of the Ensembl Genomes project (Kersey et al. 2018) (Table 1). All of these taxa are marine except for Helobdella robusta (Annelida), Biomphalaria glabrata (Mollusca), Adineta vaga (Rotifera), and Schmidtea mediterranea (Platyhelminthes). We performed analyses of gene family size evolution using CAFE version 4.2 (Han et al. 2013). In order to infer species phylogenies required for the CAFE analyses, we first identified a set of core orthologs by using HaMStR version 13.2.6 (Ebersberger et al. 2009) to search each of the 11 genomes using the set of model organism pHMMs. FASTA-formatted files for each orthogroup identified by HaMStR were generated using a custom script. Each orthogroup with at least 6 taxa was aligned using the L-INS-i algorithm in MAFFT (Katoh and Standley 2013). We then identified the best-fitting amino acid substitution model for each aligned orthogroup using the ProteinModelSelection.pl script (https://github.com/stamatak/standard-RAxML/blob/master/usefulScripts/ProteinModelSelection.pl) and a maximum likelihood tree was inferred for each. For each maximum likelihood tree, we calculated the average pairwise distance between each tip as well as a measure of branch-length heterogeneity (for each tip, the mean pairwise distance to all other tips was compared to the mean of all tip pairwise distances); both calculations are described by Struck (2014). We set our threshold for considering a score an outlier as equal to or greater than 1.5 times the interquartile range above the median, a standard threshold for outliers (Tukey 1977). Any orthogroups with trees that were outliers for either measure were eliminated. The remaining orthogroups were concatenated with FASconCAT v1.0 (Kück and Meusemann 2010). A partitioned analysis, using each orthogroup as a subset and the best-fitting model for each subset, was run in RAxML version 8.2.10 (Stamatakis 2014) with 100 rapid bootstrap iterations. Constraint analyses were also performed using the same data matrix to infer trees constrained to three alternative topologies based on recent publications (Laumer et al. 2015; Kocot et al. 2017; Marlétaz et al. 2019; Fig. 1). CAFE requires ultrametric trees with nonzero integer numbers for branch lengths. Ultrametric branch lengths were estimated for each likelihood tree using the chronos function in the ape package (Paradis and Schliep 2018) in R (R Core Team 2018) and each branch length was multiplied by 100. To determine gene family sizes, the amino acid sequences for each genome were assigned to orthologous groups via the OrthoMCL (Fischer et al. 2011) workflow on the EuPathDB Galaxy server (Aurrecoechea et al. 2017). The number of sequences present in each genome for each orthologous group was determined. Orthologous groups with no sequences for more than one genome or more than one hundred sequences for any genome were not used in our analyses. For each of the four tree topologies, CAFE was run with 100,000 random samples, and we used a search function to optimize the lambda value.

Fig. 1
figure 1

Ultrametric trees used for CAFE analyses. Freshwater species are indicated by bold type and asterisks (*). a The tree resulting from an unconstrained RAxML analysis. The other trees were constrained to match topologies found in b Kocot et al. (2017), c Laumer et al. (2015), and d Marlétaz et al. (2019)

To identify protein functions that were overrepresented in the gene families identified by the CAFE analysis as having expanded or contracted along the freshwater lineages, we performed GO (Gene Ontology) enrichment analyses. GO annotations for H. robusta were downloaded from the UniProt database (Consortium 2018). GO terms were assigned to each orthogroup based on the GO IDs of H. robusta sequences present in each orthogroup. GO enrichment analyses were performed in TopGO (Alexa and Rahnenfuhrer 2019), a Bioconductor package for R, to test for overrepresentation of GO terms in the gene families that experienced expansions or contractions in freshwater lineages. We used all orthogroups used in the CAFE analyses as our gene universe. Our set of interest consisted only of those orthogroups that each CAFE analysis identified as (1) having a significant change in gene family size, where (2) at least one of the freshwater lineages showed a significant expansion or contraction, and (3) none of the marine lineages showed a significant expansion or contraction. Analyses were done for both the Molecular Function GO set and the Biological Function GO set using the Parent–child algorithm (Grossmann et al. 2007) with the Fisher’s test statistic.

Phylogenetic ANOVAs (Garland et al. 1993) comparing gene counts between freshwater and marine species were performed using the phylANOVA function in the phytools package (Harmon et al. 2008; Revell 2012) in R for each of the 64 OrthoMCL groups identified as groups of interest in the GO enrichment analyses. The unconstrained, ultrametric tree described above was used for the phylogenetic tree and 100,000 simulations were performed for each analysis.

Results

The gene sets we used represent the seven largest spiralian phyla and include freshwater representatives of mollusks, annelids, rotifers, and platyhelminths. HaMStR assigned our protein sequences to a total of 1029 orthologous groups, 923 of which fulfilled our filtering criteria and were concatenated to make the data matrix and used to infer one unconstrained and three constrained ultrametric trees (the latter constrained to match recent hypotheses of relationships among these taxa) (Fig. 1).

OrthoMCL placed our amino acid sequences into 16,702 orthogroups, and 3796 of those met our criteria to be used in the CAFE analyses. Separate CAFE analyses were performed for each species phylogeny. Sixty-four of the orthogroups met our criterion of only showing significant expansions or contractions along one or more freshwater lineages in each analysis. Those 64 orthogroups were used as our groups of interest for the GO enrichment analyses with the 3,796 groups tested in the CAFE analysis comprising our entire gene universe. The Molecular Function analysis was able to use 2407 of the 3796 orthogroups and 44 of the 64 significant groups. The Biology Process analysis was able to use 2130 of the 3796 groups and 39 of the 64 significant groups. The top-scoring nodes for the Biological Processes Ontology are all related to transmembrane transport—regulation of transmembrane transport (GO:0034762), regulation of ion transmembrane transport (GO:0034762), and regulation of ion transport (GO:043269) are among the top five results—and for Molecular Function, transporter activity (GO:0005215) is the top result, followed by several GO terms associated with voltage-gated channels (Table 2, Fig. 2).

Table 2 Top-scoring GO nodes for the biological process and the molecular function analyses
Fig. 2
figure 2

The subgraphs that include the top 5 GO terms identified by the Parent–child algorithm for the Biological Process ontology (a) or the Molecular Function ontology (b). Rectangles indicate the top 5 terms. Each node contains the GO identifier, GO name, p value, and the number of significant groups over the total number of groups annotated to that GO term. Groups with a p value above 0.05 are shown in yellow (lighter shading in grayscale), while groups with a p value below 0.05 are shown in pink (darker shading in grayscale)

Among the top results from the ANOVAs are ion channels, ATPases, and protein kinases (Supplementary Table 1). The F, p value, and FDR for each of the 12 orthogroups that had a FDR less than 0.05 are listed in Table 3, along with keywords and Pfam domains for each group. Boxplots showing the number of gene copies in marine species and number of gene copies in freshwater species for each of those 12 groups are shown in Fig. 3.

Table 3 Top 12 scoring OrthoMCL orthogroups from the ANOVA analyses comparing gene copy number between marine and freshwater taxa for each orthogroup
Fig. 3
figure 3

Boxplots showing number of gene copies in freshwater taxa versus marine taxa for each of the 12 OrthoMCL orthogroups that were found to have the lowest p values according to the ANOVA analyses

After analyses were complete, we evaluated all of the sequences in the 64 orthogroups identified as groups of interest for evidence of contamination (see supplemental materials for details). We found 55 suspicious sequences according to our criteria out of the 4059 sequences in the set of 64 orthogroups of interest. Of the top 12 orthogroups, only two (OG5_126685 and OG5_127131) contained any suspicious sequences. We conducted phylogenetic ANOVAs with those sequences removed, and both orthogroups were still in the top 12 groups based on p value.

Discussion

These analyses support the hypothesis that gene duplications have played an important role in the adaptation to freshwater environments. We identified individual gene families that show convergent expansions along each freshwater lineage as well as protein functions that are overrepresented in the duplicated genes of the freshwater taxa, whether or not they are caused by the expansions of the same families in each freshwater lineage. The GO enrichment analyses identified which protein functions were overrepresented in the duplicated genes of the freshwater taxa. Those analyses will identify expansions of the same gene families in each taxon as well as different gene families that produce proteins with similar functions. In addition to this, we have identified specific gene families that experienced independent expansions along each freshwater lineage. This set of gene families provides a starting point for further investigation into the genomic changes involved in adaptation to freshwater environments. Similar patterns of expansion of these families may be present in all freshwater animal lineages. While we were only able to find suitable data sets for four freshwater spiralian species, they represent four independent colonizations of freshwater environments. Our analyses show convergent expansions of several gene families in each of these taxa, and we would expect to see a similar pattern in other freshwater spiralian genomes.

In an effort to reduce potential false positives, we were more conservative in our analyses than we likely needed to be. To account for phylogenetic uncertainty, we analyzed only those gene families that underwent a significant expansion or contraction along one or more freshwater lineages but no marine lineages across four alternative topologies. This left us with about half as many gene families to test as using any single topology would have, but these families are the ones with the strongest signal. We also believe our results are unlikely to be biased by potential sequencing errors or contamination. While gene set contamination could increase the noise in our data sets, in order for such contamination to yield false positives, the contamination would have to be strongly biased toward certain gene families and only in freshwater taxa. Though we found some sequences in our data that we believe may be contaminants, removing those sequences did not change the top 12 orthogroups (Table 3). Such stringency comes at a cost, though. While we have confidence in the orthogroups we identified as likely involved in the transition to freshwater environments, we recognize that we are probably excluding many other groups that were involved as well. These ANOVA results will exclude any groups that were involved in the freshwater adaptation of only a single taxon. We also do not attempt here to address what types of regulatory or structural changes might be occurring in the proteins involved.

The CAFE analyses identify gene families that experience either expansions or contractions, the majority of gene families that met our criteria showed expansions along freshwater lineages, and all of the families that showed a strong signal in our phylogenetic ANOVAs had more gene copies in the freshwater species than the marine species.

Using the set of gene families we identified with the CAFE analyses as our genes of interest, we performed GO enrichment analyses to identify any GO terms that are disproportionately represented in the gene families we analyzed. This identifies which protein functions were overrepresented among the gene families as well as whether the same families or different families with similar functions changed across the freshwater lineages. Our GO enrichment analyses for both the Molecular Function ontology and the Biological Process ontology indicate the strongest signals are all terms having to do with ion transport (Table 2, Fig. 2).

More interesting, perhaps, than GO term enrichment analyses is comparing individual gene families between habitat types. Phylogenetic ANOVAs comparing gene family sizes between freshwater and marine species found evidence of a difference between marine and freshwater species in several gene families. In all of the families with the strongest signal there was an increase in gene copy number in the freshwater group (Fig. 3). These gene families may be of most interest because this suggests repeated, independent expansions of those gene families occurred in each freshwater lineage. Among these groups is the sodium–potassium ATPase family (OG5_127003), which is of particular interest as it is the target of most gene expression studies investigating transitions between marine and freshwater environments. These genes have been shown to be upregulated in euryhaline animals during salinity changes (Kang et al. 2008; Lee et al. 2011; Havird et al. 2013), and the expansion of this gene family in the H. robusta lineage was shown to have coincided with a freshwater radiation (Horn et al. 2019). Upregulation of these genes associated with changes in environmental salinity suggests that increased gene dosage could be an advantage conferred by the increased copy number seen in the freshwater taxa.

Also among the gene families showing the most pronounced difference between freshwater and marine groups is a voltage-gated potassium channel (OG5_127659). Potassium channel proteins have been shown to be involved in cellular osmoregulation (Xu et al. 2016), and voltage-gated channels are an important part of cellular ion regulation, as they are triggered by the electrical differential across the membrane caused by an ionic differential. This particular family appears to have increased in size in each of the freshwater taxa examined (Fig. 3) and should be investigated in other freshwater taxa as well.

Clearly, gene duplication played a significant role in the colonization of freshwater habitat by these spiralian animals, and the expansion of certain gene families appears to be necessary for the colonization of freshwater habitats. The exact nature of the role those duplicated genes played in the adaptation to freshwater habitats is still unclear. Euryhaline animals tend to experience an upregulation of the sodium–potassium pump genes during salinity changes (Kang et al. 2008; Lee et al. 2011; Havird et al. 2013). Duplications of that gene in freshwater taxa, as we have found here (OG5_127003), could provide an increase in protein dosage. A similar increase in protein dosage may help explain the role of the duplications we see in other ion transport gene families. It is also possible that duplicate copies provided redundancy for selection to act on, allowing the evolution of new regulatory patterns. The timing of these gene duplications, in relation to the freshwater invasions, may also provide important insights to the evolutionary processes involved in the colonization of freshwater environments, particularly in groups such as gastropod mollusks in which there were many independent successful freshwater colonizations. Did each freshwater lineage experience independent gene family expansions, or did at least some of those expansions occur earlier, making the group as a whole better able to adapt to freshwater environments? While we are unable to address those questions with these data, this study provides a list of gene families to begin more specific investigations within these taxa. It is not our intention to suggest that these results represent an exhaustive list of genes involved in the transition to freshwater habitats, or necessarily to explain how these gene duplications contributed to those transitions. Rather, we hope to provide a common set of gene families that we can assert with a high degree of confidence were involved the transition from marine to freshwater habitats in every freshwater lineage we examined in Spiralia. We detected a dozen gene families that show significant independent expansions in each of the freshwater lineages, and this is surely a low estimate given our conservative approach. This suggests that the transition to freshwater habitats is not a genetically trivial endeavor. Our work here provides a starting point to begin investigating the exact nature of the role these duplications played, and to begin to look at which adaptations to freshwater environments are lineage specific and which are more common, or even universal.