Introduction

The most popular mammalian model organisms currently in use are the mouse (Mus musculus, NCBI Taxon ID 10,092) and the rat (Rattus norvegicus, NCBI Taxon ID 10,116), accounting for over 73% of all vertebrate animals used in research (61% and 12%, respectively) in 2017 in Europe (EUR-Lex—52020DC0016—EN—EUR-Lex.). Both have a rich history in the research community. The rat was the first animal domesticated for research purposes and has been used since 1856 (Philipeaux 1856; Modlinska and Pisula 2020), while mice followed in the early twentieth century (Morse 1981). Both species have their own set of (dis)advantages: mice are small, easy to breed and cost-effective to keep, while the larger size of rats allows more flexibility in assays to be performed (e.g., multiple samples, higher resolution imaging) (Bergman et al. 2000). Rats are more often used than mice in cardiovascular research and neurobiology (Ellenbroek and Youn 2016). The genome of both species has been sequenced, mice in 2002 (C57BL/6J as reference strain) (Mouse Genome Sequencing, C et al. 2002) and rats in 2004 (BN/SsNHsd as reference strain) (Gibbs et al. 2004), as part of a large scale effort to sequence the human genome and the genomes of important model organisms. The publication of these genome sequences has marked the beginning of a new era of research into the genetics and genomics of these and other species, functional studies, evolution, comparison and much more.

About a decade after the first version of each reference genome was published, the widespread adoption and advancement of novel massively parallel sequencing technologies (e.g., using the illumina sequencing platforms) has made genome sequencing faster and less expensive an allowed for the sequencing of genomes of multiple inbred strains of mice and rats. The discovery of sequence variations on a genome wide level and investigation into their effects has hence become possible. The mouse genomes project (MGP), launched and performed by the Sanger Institute, sequenced the most used and popular mouse strains, releasing a first set of strain-specific variants in 2011 (Keane et al. 2011). Currently their dataset includes SNPs and small indels from 36 inbred mouse lines, but the MGP performs additional sequencing on known lines and additional lines will be included, based on data listed on their ftp-server, but they have at the time of writing this work not yet been published. In contrast to the mouse scientific research, which is performed almost exclusively in inbred lines, the rat community also makes use of outbred stocks, which have an undefined genetic background (Sharp et al. 2013). The interest in inbred line characterization for rats was therefore traditionally lower and was performed in two main publications, one rat, by Atanur et al. (2013) including 28 inbred rat strains and a more expanded analysis on 40 lines (including the previous 28) by Hermsen et al. (2015) and is made available through the rat genome database (Smith et al. 2020). All information from small variants in mice and rats can be accessed and downloaded at their respective databases (MGP and rat genome database) in variant call format, annotated with the Ensembl variant effect predictor (VEP) and directly accessed through web-based variant browsers.

As information that is specific to the coding sequence and changes to the amino acid sequences cannot be derived immediately or trivially derived for the variant data, our lab has previously processed coding variants, on a per codon basis, in mouse and rat and made these available alongside PROVEAN-based predictions on the functional impact on the variant protein, as the “Mousepost” (Timmermans et al. 2017) and “Ratpost” (Timmermans and Libert 2021) databases, respectively. These online-available databases offer complementary protein level information to the existing nucleotide level resources available at the mouse genomes project and the rat genome database.

In this study, we compare the data that are present in the “Mousepost” and “Ratpost” databases. We investigate if the variants in mouse are also found in rat and vice versa, and if inbred lines from these species can be used interchangeably to some extent or if the variants found are species-specific and/or if the databases provide two complementary datasets that can be used in research.

Results and discussion

Linking mouse and rat variant transcripts

We used the mouse–rat ortholog gene information from the Ensembl website to obtain orthologous gene peptide IDs which were mapped back to related transcript ID. In case of multiple options, only the best match was retained. We filtered both “Mousepost” and “Ratpost” databases on the occurrence of orthologous pairs and we identified 8154 orthologous pairs that had at least one non-synonymous variant in both rat and mouse protein sequences, out of 18,788 total pairs. However, the mouse dataset contains several recently wild derived inbred lines. These are highly divergent from the C57BL/6J reference strain, which has been bred in captivity for about a century. This level of variation can possibly skew the results due to the fact that these contain a hight amount of private (strain-specific) variants. The use of transcripts instead of the actual variants partially compensates for this, but in order to minimize the effect we perform this comparison with the four most divergent wild derived lines removed (SPRET/EiJ, PWK/Phj, CAST/EiJ and MOLF/EiJ), which each contain thousands more variants than any of the other strains. This lowers the overlap only slightly, from 8154 to 7659 transcript. As this excludes an overly large bias introduced by wild derived strains, we will include them in our analyses. The total number of variant proteins in rat is almost a subset of the mouse set. A total of 18,788 mouse coding transcripts in the orthologous set has at least one variant, but for rat there are only 8633 sequences identified in the orthologous set that are deviant from the reference sequence. Thus 94.5% of all rat sequences are shared with mouse; however, this does not mean that these sequences are mutated in a similar manner in mouse and rat, only that they are deviant from the reference sequence in both species. It is also interesting to place this into perspective of the total amount of rat variants (Fig. 1).

Fig. 1
figure 1

Overlap between mouse and rat orthologous protein coding transcript sequences that have at least one or more sequence variations in at least one rat and one mouse inbred strain. The mouse data have more variants called than the rat. Almost all rat sequences (8154) with a variant in one inbred line have an ortholog in mouse that is also mutated in at least one inbred strain

In our previous “Ratpost” publication we described a total of 12,172 protein coding transcripts with at least one non-synonymous variant (Timmermans and Libert 2021). Taking into account that 8154 transcripts were found to have a variant in at least one inbred line of both strains, and that 479 transcripts only have variants in rat, this means that 3539 rat transcripts have no assigned ortholog in mouse, either because there is none known, or because it is part of a many-to-many orthology relationship, where one rat gene is orthologous to multiple mouse genes and vice versa. Upon further investigation there were no large pathways or groups in this set, but several genes were found to belong to taste and smell receptors (Olfr and Vmn1r families).

The vomeronasal, olfactory and also the MHC gene families are known for presence–absence polymorphism and lower than normal reliability in variant calling (Ibarra-Soria et al. 2017). In this comparison, as well as in Mousepost and Ratpost, only members that are present in a strain are included, in case of a gene deletion in a strain, the locus is not further investigated, in effect this is the same approach as for a locus that contains no missense or nonsense variants. As we seek to provide an overview that is as complete as possible, these gene families were not excluded, but users should be careful with data from these gene families.

The large difference between mouse and rat in number of variants and variant-containing sequences can likely be attributed to the sequencing efforts performed, which is directly related to their respective use in the research community. Mouse-specific sequencing efforts are headed by a large institute (the Sanger Institute), and additional sequencing/resequencing is being done giving higher coverage to perform variant calling. By contrast in rat there has been only limited strain-specific sequencing efforts so far and should more sequencing data become available it is likely that the amounts of variants, and transcripts with variants will increase. In addition, only the most used strains have been sequenced, which is only a fraction of the total available inbred strains (hundreds in both rat and mouse (Sharp et al. 2013; Beck et al. 2000), and this number may change if and when more data from other strain becomes available.

Overlap between variant classes for orthologous sequences

When taking into account the different variant classes resulting from small indels and SNPs that were defined in the “Mousepost” and “Ratpost” databases, namely stop-gain (SG), stop-loss (SL), and non-stop-related mutation (MUT), large differences between mouse and rat variants become readily apparent. Especially for the transcripts that are SG or SL, the overlap between rat and mouse variants is very limited at the classification level, with only 42 transcripts found that result in a nonsense mutation in both species from the 684 rat (6.1%) and 374 mouse (11.2%) transcripts (Table 1) in that category. The same observation can be made for SL: only eight transcripts are annotated as SL in at least one strain of both species, which is 11.1% for rat and a poor 0.65% for mouse (Table 1). This shows that there is only (very) limited overlap between the variant classes between mouse and rat. It will be difficult to find conditions where there will be genetic equivalence between a mouse and a rat strain for a specific transcript/gene. However, this also allows for a large natural repository with an overview of multiple variants all resulting in loss of function to some degree. We make a detailed comparison of the transcripts that contain an snp/indel in both mouse and rat inbred strains and compare the variants they contain, with a focus on those that make up a missense event (MUT) in at least one strain of the one or both species.

Table 1 Number of transcripts assigned in each variant class in mouse and rat

Comparing variant positions

For transcripts that have conserved SG or SL variants between species, comparing the size of the truncation or sequence gained is straightforward and was performed in an indirect manner by means of protein length ratios obtained by comparing the strain-specific sequence to the respective reference sequence. This compensates directly for variations in length caused by the evolutionary difference between species.

In contrast, comparing positions and substitutions caused by missense mutations (class MUT) was non-trivial due to the differences caused by evolution since speciation. In order to perform comparisons, the reference sequences of each MUT protein were pairwise aligned using Biopython to obtain the optimal global alignment for each orthologous pair. Results from this alignment step were used to create a one-to-one map of mouse vs rat protein positions. Variants were queried from the “mousepost” and “ratpost” databases and compared using the positional map that was previously constructed on a per transcript basis. For the MUT transcripts there were a total of 74,964 variant positions present. Only 619 of these were found in both mouse and rat and 14,599 were found only in the rat and 59,746 were found in the mouse and did not have a rat equivalent (Fig. 2).

Fig. 2
figure 2

Variants found in mouse–rat orthologs. Only 619 variants occur at equivalent positions in rat and mouse, the large majority of variants in inbred lines is species-specific

A large portion of these variants may result in loss of function (LOF) mutations. Overall, LOF variants can have two main origins, they can be present from natural genetic divergence between species, but can also be the result of relaxed evolutionary pressure on genes that are less important in captivity. The latter group will occur in all species kept in captivity, and thus should be found enriched in the mouse–rat overlap.

These data suggest that the mouse and rat variant databases contain a large amount of complementary information. This increases the chance that at least one of the two model organisms will have a natural variant that matches a human variant. The “Ratpost” and “Mousepost” databases can be used to find such variants for human pathological mutation, such as these described in the Clinvar database (Landrum et al. 2018), on the condition that they are protein coding. Furthermore, it also indicates that some studies must be done, or are much easier to do in a species-specific manner that leverages existing variation between strains. An example of this is the LPS resistance observed in the C3H/HeJ mouse strain, which was found to be caused by a mutation in the Tlr4 gene (Poltorak et al. 1998) and can also be found in “Mousepost”. Since this gene is not found to have a protein level mutation in any of the rat strains, it would not have been possible to find association between Tlr4 and LPS resistance using the rat model organism without performing active (random) mutagenesis.

Non-position equivalent variants

The large majority of mouse and rat variant transcripts do not share a position-specific variation. This does not mean that the protein suffers from loss of function in the inbred lines of one species only. In many cases, the proteins have completely different mutations that result in the same outcome: loss of function. A clear example is found in the tyrosinase protein, in which loss of function results in albinism. Several inbred lines in the mouse have the albino phenotype, related to an amino acid substitution at position 103 (cysteine loss: C103S, predicted deleterious: PROVEAN score = − 9.74) (Yokoyama et al. 1990). All mouse lines with a mutation in the Tyr gene have this variant and are albino. In the rat there are also many inbred strains with an albino phenotype and the Tyr gene is also mutated in those. However, none of the rat strains have a C103S variant and indeed all match the rat reference sequence at that position, a cysteine (Supplemental Fig. 1). Instead, the albino rat strains share a different mutation at position 299 (R299H, predicted deleterious: PROVEAN score = − 3.21). This variant at position 299 is rat-specific, and was shown to cause albinism in F344 in a previous study (Lu et al. 2007), and all mouse strains have the same sequence here as the rat reference strain: an arginine (Supplemental Fig. 2).

Position equivalent variants

For the 619 variants that occur at equivalent positions in mouse and rat, we found four different types (Fig. 3). All variant possibilities were compared, so for example if a rat variant has two mouse variants (in different strains) at the same position, it is included twice. There are a total of 120 variants (group i) that are completely identical in occurrence in mouse and rat, meaning that gave the same reference amino acid (AA) that is changed to the same alternate residue is some strains (e.g., a V to D in both). One such example is the Uspl1 gene, which has a shared position between rat and mouse: an A164L substitution but is not predicted to be deleterious in rat or mouse. A little less variant positions, i.e., 109 (group ii), have the same reference AA but a different alternative strain-specific AA (e.g., V to D in mat and V to M in mouse). A practical example is the Sun2 gene, which has a mutation resulting in the replacement of an alanine at position 285. In mouse this results in a threonine (MOLF/EiJ; score: − 2.16), while in rat a glycine is found instead (ACI/EurMcwi; score: − 1.82). The smallest group (group iii) has convergent variants, i.e., where the sites have a different reference AA, but the mutation results in the same AA (e.g., D to V in rat and A to V in mouse) such as found for the Mylk gene where in mouse there is a V at A substitution at position 465 (MOLF/EiJ,; score: 1.62) and the rat equivalent of this position (455) shows a T to A change (SBH/Yg; score: 1.13). Finally, the largest group (group iv) of 334 variants shows no relation between mouse and rat for position equivalent variants.

Fig. 3
figure 3

Proportion of the different groups of the 619 shared variant positions. Group I, the set of identical inbred line changes in both mouse and rat, is highlighted. Groups, I (identical variants in rat/mouse), II (same position but different substitution in rat and mouse), III (different reference AA at the position, but changed to identical alternative) and IV (mouse and rat reference as well as variants are different for the position)

This total of 619 shared positions corresponds with 464 distinct protein sequences in each species. A gene set enrichment analysis for pathways and gene ontology (GO) using Metascape terms shows an over-representation of primarily immune process-related functions (Fig. 4).

Fig. 4
figure 4

Gene set enrichment analysis for overrepresented pathways and functions for the 464 genes containing position conserved variants in mouse and rat inbred strains. The top 20 pathways and functions are enriched for immune-related processes, such as signaling and also blood clotting and to a lesser extent DNA damage/repair. Gene set enrichment was performed using metascape and the annotation of the mouse genes of the ortholog pairs

These functions may be related to the environmental pressures present in these rats and mice. Laboratory animals are known to have a somewhat altered immune function compared to their wild counterparts, mainly due to lower exposure to pathogens (Viney et al. 2015; Yeung et al. 2020). This may result in a convergent evolution of genes in immune processes over many generations.

In addition, especially the variants in group i are potentially very useful: these positions (both the reference and the alternative AA) have been conserved through evolution or occurred independently twice. If these variants have a PROVEAN score less than − 2.5 and are predicted to have a negative effect on protein function. This − 2.5 is the cut-off point below which a variant is considered to be deleterious for protein function, if accepting the published 80% balanced accuracy (same true positive and true negative rate) as reported in the PROVEAN manual (Choi et al. 2012). It is possible that other AA substitutions have a more severe, intolerable, effect.

Conclusion

We have performed an in-depth comparison of our previously published dataset of protein coding variants in mice (Mousepost) and rats (Ratpost). We show that while the set of transcripts that is affected by a variant in both species shows a very large overlap, also in part that due the fact the majority of transcripts will have one or more variants if a sufficient number of strains are sequenced, this is not the case for individual variants, which show only minimal overlap. Overall the “Ratpost” and “Mousepost” databases, as well as the mouse and rat model organisms are mainly complementary where protein coding variants are concerned. This large repository of natural variations may serve as a useful tool to select a specific inbred strain and species for a (human) disease model.

Materials and methods

Coding sequence variants

We obtained coding sequence variation data from the “Mousepost” and “Ratpost” databases. These data were derived from the mouse genomes project for mouse inbred lines and from the rat genome database for rat inbred strains. The collection of complete protein sequences from the construction of “Ratpost” and “Mousepost” databases was used in this analysis (Timmermans and Libert 2021).

Orthologous genes and transcripts

Information concerning orthologous mouse–rat relationships was downloaded from the Ensembl Biomart webtool (Yates et al. 2020). The data filtering settings were specified as protein coding, orthologous mouse genes only with the ‘mouse homology type’ attribute added. The resulting datafile was filtered to obtain a set of one-to-one relationships.

Sequence alignment and position conversion

We made use of the python scripting language (v3.8) for all steps of the analysis, global pairwise sequence alignments were performed using the Biopython (Cock et al. 2009) align module using the Blosum62 substitution matrix and gap opening penalty of 10 and extension penalty of 0.5. Alignments results were kept in memory and processed into a lookup table of mouse to rat (and rat to mouse) position matches. The “Mousepost” and “Ratpost” mysql databases were queried for all variants in the transcript under the from “reference_AA position alternative_AA”. Variant position ware compared based on the position lookup table and exact AA at the position.

Gene set enrichment

Enrichment analysis on the set of genes from the 619 shared positions was performed using Metascape (Zhou et al. 2019).